(12) INTERNATIONAL APPLICATION PUBLISHED UNDER THE PATENT COOPERATION TREATY (PCT) 



(19) World InteDectual Property 
Organizatioa 
International Bureau 

(43) Internationa] Publication Date 
22 July 2004 (22.07.2004) 



mm 




PCT 



1 



li 




(10) International Publication Number 

wo 2004/061616 A2 



(51) International Patent Classification^: 



G06F 



(21) International Application Number: 

PCTAJS2003/041613 

(22) International Filing Date: 

24 December 2003 (24.12.2003) 



(25) Filing Language: 
(24») Publication Language: 



English 
English 



(30) Priority Data: 

60/436,684 27 December 2002 (27.12.2002) US 

60/460,343 2 April 2003 (02.04.2003) US 

(71) Applicant (for all designated States except US): 
ROSETTA INPHARMATICS LLC [US/US]; 12040 
II 5th Avenue, N.E., Kirkland, WA 98034 (US). 



(72) Inventors; and 

(75) Inventors/Applicants (for US only): SCHADT, Eric 
[US/US]; 1517 3rd Place, Kirkland, WA 98033 (US). 
MONI^, Stephanie, A. [US/US]; 906 N.£. 122nd Street, 
Seatde, WA 98125 (US). 

(74) Agents: ANTLER, Adrians, M. et al.; Pennie & Edmonds 
LLP, 1155 Avenue of the Americas, New Yoric, NY 10036 
(US). 

(81) Designated States (national): AE, AG, AL, AM. AT, AU, 
AZ, BA, BB, BG, BR, BY, BZ, CA, CH, CN, CO, CR, CU, 
CZ, DE, DK, DM, DZ, EC, EE, EG, ES, H, GB, GD, GE, 
GH, GM, HR, HU, ID, IL, EST, IS, JP, KE, KG, KP, KR, 
KZ, LC, LK, LR, LS, LT. LU, LV, MA, MD, MG, MK, 
MN, MW, MX, MZ, NX, NO, NZ, OM, PG, PH, PL, PT, 
RO, RU, SC, SD, SE, SG, SK, SL. SY, TJ, TM. TN. TR, 
TT, TZ, UA, UG, US, UZ, VC, VN, YU, ZA, ZM, ZW. 

(84) Dedgnated States (mgional)t ARIPO patent (BW, GH, 
GM, KE, LS, MW, MZ, SD. SL, SZ. TZ. UG, ZM, ZW), 

[ Continued on next page] 



(54) Title: COMPUTER SYSTEMS AND METHODS FOR ASSOCIATING GENES WITH TRAITS USING CROSS SPECIES 
DATA 



10 



< 

VO 

V© 



O 



20 



22 



CPU 



^34 




^26 


36 / 




-( ) 


'■J 


✓noi 

/DO DC 




User interface 



Memory 



28 



NIC 



Operating system 

[ File system 



Gene expression / oelluiar 
consdfuent data 



Organism 1 





Gene 1 / cellular constituent 1 




Intensity or cellular 




constituent level 1 




Optional background signal 




Optional gene probe annotation 




■ 
• 
• 




Gene N / cellular constituent N 


• 
■ 
• 



Organism Q 



Genotype and [»dlgpee data 



Marker data 



Nomiall^tion module 



Marker map construction module 



Expraasion / genotype warehouse 
Genetic marker map 



Genetic analysis module 



QTL results datatiase 





QTL1 






Position 1 






Statistical score 






• 
• 




QTLM 



Clustering module 



Cluster datatiase 



MultivartatB QTL analysis module 
Phenotyplc data 



40 

•42 
'44 

4a.i 

48-1-1 
■50-1-1 

■52-1-1 
■54-1-1 

■48-1-N 

-46-Q 
-68 

■70 

-72 

'74 

-76 

-78 

-80 

-82 

-84-1 

-86-1-1 

-88-1-1 

-d4-M 

-92 

-94 

-90 

-95 



(57) Abstract: A method for confirming the association of a 
query QTL or a query gene in the genome of a second species 
with a clinical trait T exhibited by the second species. A first 
QTL or a first gene in a first species that is linked to a trait T* is 
found The trait T' is indicative of trait T. A region of the genome 
of the first species that comprises the first QTL or the first gene 
is mapped to a particular region of the genome of the second 
species. A query QTL or a query gene in the second species that 
is potentially associated with the trait T is found. The potential 
association of the query QTL or the query gene with the clinical 
trait T is confirmed when the query QTL or the query gene is in 
the particular region of die genome of the second species. 



wo 2004/061616 A2' liliillillililliiliinilllilliimni 



Eurasian patent (AM, AZ, BY, KG, KZ, MD, RU, TJ, TM), For two-Utter codes and other abbre\'iations, refer to the "Guid- 

European patent (AT, BE, BG, CH, CY, CZ, DE, DK, EE, ance Notes on Codes and Abbreviations" appearing at the begin- 

ES, FI, FR, GB, GR, HU, IE, H, LU, MC, NL, PT. RO, SE. ning of each regular issue of the PCT Gazette, 
SI, SK, TR), OAPI patent (BF, BJ, CP, CO, Q, CM, OA, 
GN, GQ. GW, ML, MR, NE, SN, TD, TG). 

Published: 

— without international search report and to be republished 
upon receipt of that report 



4 



wo 2004/061616 PCTAJS2003/041613 

COMPUTER SYSTEMS AND METHODS FOR ASSOCIATING GENES WITH 

TRAITS USING CROSS SPECIES DATA 

CROSS-REEERENCE TO RELATED APPLICATIONS 

This application claims benefit, under 35 U,S.C. § 1 19(e), of U.S. Provisional 
Patent Application No. 60/436,684 filed on December 27, 2002 which is incorporated 
5 herein, by reference, in its entirety. This application also claims benefit, under 35 U.S.C. 
§ 1 19(e), of U.S. Provisional Patent AppUcation No. 60/460,343 filed April 2, 2003, 
which is incorporated herein, by reference, in its entirety. 

1. FIELD OF THE INVENTION 
10 The field of this invention relates to computer systems and methods for 

identifying genes and biological pathways associated with traits. Jn particular, this 
invention relates to conaputer systems and methods for using both gene expression data 
and genetic data to identify gene-gene interactions, gene-phenotype interactions, and 
biologicai pathways linked to traits in one species using data from another species. 

15 

2. BACKGROUND OF THE INVENTION 

A variety of approaches have been taken to identify genes and pathways that are 
associated with traits, such as human disease. £q one approach, attempts have been made 
to use gene expression data to identify genes and pathways associated with such traits. In 

20 . another ^roach, genetic information has been used to attempt to identii^ genes and 
pathways associated with traits. For instance, clinical measures of a population may be 
taken to study a trait such as a disease found in the population. Risk factors for the trait 
can be established fix)m these clinical measures. Demographic and environmental fiictoirs 
are fiirther used to explain variation with respect to the trait Furflier, genetic variations 

25 associated with traits, such as disease-related traits, as well as the disease itself are used to 
identify regions in the genome linked to a disease. For exaniple, genetic variations in a 
population may be used to determine what percentage of the variation of the trait in the 
population of interest can be explained by g^etic variation of a single nucleotide 
polymorphism (SNP), haplotype, or short tandem repeat (STR) marker. However, as will 

30 be described below, the elucidation of genes involved in biological pathways that 
influence a trait, such as a disease, using either gene e?q>ression or genetic expression 
approaches, is problematic and generally not successful in many instances. 



1 



wo 2004/061616 PCT/US2003/041613 

2.1. USE OF MEASURED GENE EXPRESSION DATA TO IDENTIFY GENES 

AND PATHWAYS ASSOCIATED WITH TRAITS 

Within the past decade, several technologies have made it possible to monitor the 
expression level of a large nmnber of transoipts at any one time {see, e,g., Schena et aL, 

5 1 995, Quantitative monitoring of gene expression patterns with a complementary DNA 
microairay, Science 270:467-470; Lockhart et al., 1996, Expression monitoring by 
hybridization to high-density oligonucleotide arrays. Nature Biotechnology 14:1675- 
1680; Blanchard et aL, 1996, Sequence to array: Probing the genome's secrets, Nature 
Biotechnology 14, 1649; U.S. Patent 5,569,588, issued October 29, 1996 to Ashby et al 

10 entitled 'Methods for Drug Screening"). In organisms for which the complete genome is 
known^ it is possible to analyze ^e transcripts of all genes within the cell. With other 
organisms, such as human, for which there is an increasing knowledge of the genome, it 
is possible to simultaneously monitor large numb^ of the genes within the celL 

Such monitoring technologies have been apjplied to the identification of genes that 
15 are up regulated or down regulated in various diseased or physiological states, the 
analyses of members of signaling cellular states, and the identification of targets for 

» 

varioiis drugs. See, e.g. Friend and Hartwell, U.S. Patent Number 6,165,709; Stoughton, 
U.S. Patent Number 6,132,969; Stoughton and Friend, U.S. Patent Number 5,965,352; 
Friend and Stoughton, U.S. Patent Number 6,324,479; and Friend and Stoughton, U.S. 
20 Patent Number 6,2 1 8, 1 22, all incorporated herein by reference for all purposes. 

Levels of various constituents of a ceU are known to change in response to drug 
treatments and other perturbations of the biological state of a cell. Measurements of a 
plurality of such "cellular constituents" therefore contain a wealth of information about 
the effect of perturbations and their effect on the biological state of a cell. Such 

25 measurements typically comprise measurements of gene expression levels of the type 
discussed above, but may also include levels of other cellular components such as, but by 
no means limited to, levels of protein abundances, protein activity levels, or protein 
interactions. The collection of such measurements is generally referred to as the ''profile'* 
of the cell's biological state. Statistical and bioinfoimatical analysis of profile data has 

30 been used to try to elucidate gene regulation events. Statistical and bioinformatical 
techniques used in this analysis comprises hierarchical cluster analysis, reference or 
supervised classification approaches and correlation-based analyses. See, e.^., Tamayo et 
al, 1999, Interpreting patterns of gene expression with self-organizing maps: methods 
and ^hcation of hematopoietic differentiation, Proc. NatL Acad, Sci, U.SA, 96:2907- 



2 



wo 2004/061616 PCTAJS2003/041613 

2912; Brovm et al, 2000, Knowledge-based analysis of microairay gene expression data 
by using siipport vector machines, Proc. Nail. Acad Set U.S.A.: 97, 262-267; 
Gaasterland and Bekinraov, Making the most of microarray data, Nat Genet : 24, 204- 
206, Cohen et al, 2000, A computational analysis of whole-genome expression data 
5 reveals chromosomal domains of gene expression, Nat Genet 24: 5-6, 2000, 

Ihe use of gene expression data to identify genes and elucidate pathways 
associated with traits has typically relied on the clustering of gene e3q>ression data over a 
variety of conditions. See, e.g., Roberts et al, 2000, Signaling and circuitry of multiple 
MAPK pathways revealed by a matrix of global gene expression profiles; Science 

r 

10 287 :873; Hu^es et at, 2000, Functional Discovery via a Compendium of Expression 
Profiles, Cell 102:109. However, gene expression clustering has a number of drawbacks. 
First, gene egression clustering has a tendency to produce false positives. Such false 
positives arise, for example, when two genes coincidentally have correlated expression 
profile over a variety of conditions. Second, although gene expression clustering 

IS provides information on the interaction between genes, it does not provide information on 
the topology of biological pathways. For example, clustering of gene expression data 
over a variety of conditions may be used to determine that genes A and B interact 
However, gene expression clustering typically does not provide sufficient information to 

« 

detennine whether gene A is downstream or upstream firom gene B in a biological 
20 pathway. Third, direct biological experiments are often required to validate the 

involvement of any gene identified from the clustering of gene expression data in order to 
increase the confidence that the target is actually valid. For these reasons, the use of gene 
expression data alone to identify genes involved in traits, such as various complex human 
diseases, has often proven to be unsatis&ctory. 

25 

2.2. USE OF GENETICS DATA TO roENTBBY GENES AND PATHWAYS 

ASSOCIATED WITH TRAITS 

Genetics data have been used in the field of trait analysis in order to attempt to 
identify the genes that affect such traits. A key development in such pursuits has be^ the 
30 development of large collections of molecular/genetic marieers, which can be used to 
construct detailed genetic maps of species, such as humans. These maps are used in 
• Quantitative Trait Locus (QTL) mapping methodologies such as single-marker mapping, 
interval mapping, conaposite interval mapping and multiple trait mapping. (For a review, 
see Doerge, 2002, Mapping and analysis of quantitative trait loci in experimental 



wo 2004/061616 . PCT/US2003/041613 

populations. Nature Reviews: Genetics 3 :43-62. QTL mapping methodologies provide 

* 

statistical analysis of the association between phenotypes and genotypes for the purpose 
of understanding and dissecting the regions of a genome that affect traits. 

A quantitative trait locus (QTL) is a region of auy genome that is responsible for 
5 some percentage of the variation in the quantitative trait of interest. The goal of 
identifying all such regions that are associated with a specific phenotype is typically 
difficult to accomplish because of the sheer number of QTL, the possible epistasis or 
interactions between QTL^ as well as many additional sources of variation that can be 
difficult to model and detect. To address these problems, QTL experiments can be 
10 designed with the aim of containing the sources of variation to a limited number ia order 
to improve the chances of dissecting a phenotype. In general, a large sample of 
individuals has to be collected to represent the total population, to provide an observable 
number of recombinants and to allow a thorough assessment of the trait under 
investigation. Using this ioformation, coupled with one of several methodologies to 
1 S detect or locate QTL, associations between quantitative traits and genetic markers are 
made as steps toward understanding the genetic basis of traits. 

A drawback with QTL approaches is tiiat, even when genomic regions that have 
statistically significant associations with traits are identified, such regions are usually so 
large that subsequent experiments, used to identify specific causative genes in these 

20 regions, are time consuming and laborious. High density marker maps of the genomic 
regions are required. Furthermore, physical resequencing of such regions is often 
required In &c% because of the size of the genomic regions identified, there is a danger 
. that causative g^es wi&in such regions simply will not be identified. In the event of 
success, and the genomic region containing genes that are responsible for the trait 

25 variation are elucidated, the expense and time firom the beginning to the end of this 

process is often too great for identifying genes and pathways associated with traits, such 
as complex human diseases. 

In the case of humans, the use of genetics to identify genes and pathways . 
associated.with traits follows a very standard paradigm. First, a genome-wide linkage 
30 study is performed using hundreds of genetic markers in family-based data to identify 
broad regions linked to the trait. The result of this standard sort of linkage analysis is the 
identification of regions controlling for the trait, thereby restricting attention from the 
30,000 plus genes to perhaps as few as 500 to 1000 genes in a particular region of the 

4 



wo 2004/061616 PCTAJS2003/041613 

geuome that is linked to the trait However, the regions identified usiag linkage analysis 
are still far too broad to identify candidate genes associated with the trait. Therefore, 
such linkage studies are typically followed up by fine mapping the regions of linkage 
using higher density naarkers in the linkage region, increasing the number of faxnilies in 

5 the analysis, and identifying alternative populations for study. These efforts finrfher 
restrict attention to narrower regions of the genome, on the order of 100 genes in a 
particular region linked to the trait Even with the more narrowly defined linkage region, 
the number of genes to validate is still unreasonably large. Therefore, research at this 
stage focuses on identifying candidate genes based on putative function of known or 

10 predicted genes in the regjion and the potential relevance of that Amotion to the trait This 
approach is problematic because it is limited to what is currently known about genes. 

4 

■ 

Often, such knowledge is limited and subject to interpretation. As a result, researchers 
are often led astray and do not identify the genes affecting the trait. 

There are many reasons that standard g^aetic approaches have not proven very 

IS successful in the identification of genes associated with complex traits, such as common 
human diseases, or the biological pathways associated with such traits. First, conmion 
human diseases such as heart disease, obesity, cancer, osteoporosis, schizophrenia, and 
many others are complex in that they are polygenic. That is, they potentially involve 
many genes across several different biological pathways and they involve complex 

20 gene-environment interactions that obscure the genetic signature. Second, the complexify 
of the diseases leads to a heterogeneity in the different biological pathways that can give * 
rise to the disease. Thus, in any given heterogeneous population, there may be defects 
across several different pathways that can give rise to the disease. This reduces the ability 
to identify the genetic signal for any given pathway. Because many populations involved 

25 in genetic studies are heterog^eous with respect to the disease, multiple defects across 
multiple pathways are operating within the population to give rise to the disease. Third, 
. as outlined above, the genomic regions associated with a linkage to a disease are large 
and often contain a number of genes and possible variants that are potentially associated 
with the disease. Fourth, the traits and disease states themselves are often not well 

30 defined. Therefore, subphaiotypes are often overlooked even though these 

subphenotypes implicate different sets of biological pathways. This reduces the power of 
detecting the associations. Fifth, even when genes and trait are highly correlated, the 
genes may not give the same genetic signature. Sixth, in cases where genes and a trait are 



5 



wo 2004/06liS16 PCT/US2003/041613 

moderately correlated, or not correlated at all, the genes may give rise to the same genetic 
signature. 

. In addition to the heterogeneity problems discussed above, tiiie identification of 
genes and biological pathways associated with traits, such as complex human diseases, 
5 using genetics data is confounded, when using human subjects, due to the inability to use 
conmion genetic techniques and resources m humans. For example, humans cannot be 
crossed in controlled experiments. Therefore, liiere is typically very little pedigree data 
available for humans. Elucidation of genes associated with diseases in humans is also 
dif&cult because humans are diploid organisms containing two genomes in each nucleate 
^ 10 cell, making it very hard to determine the DNA sequence of the haploid genome. 

Because of these limitations, genetic approaches to discovering g^aes and biological 
pathways associated with human diseases is unsatisfactory. 

Companies such as deCode Genetics (Reykjavik, Iceland) study populations that 
are isolated and so are more homogenous with respect to disease, thereby increasing the 

IS power to detect association. The disease variations themselves in such populations are 
greatly reduced as founder effects for many diseases are evidmt (i.e., specific forms of 
diseases in such populations most likely arose from a single or small numbers of founders 
of the population). Other companies, such as Sequenome (San Diego, California), use 
twm cohorts to study complex diseases. Identical twins are a powerfiil tool in 

20 establishing the genetic component of a trait. The genetic component of a trait is defined 
as the degree to which a given trait is under genetic control. Dizygotic twins allow for 
age, gender and environment matching, which helps reduce many of the confoundiug 
factors that often reduce the power of genetic studies. In addition, the completion of the 
human and mouse genomes has made the job of identifying candidate genes in a region of 

25 linkage fiir easier, and it reduces dependency on considering only known genes, since 
genomic regions can be aimotated using ab initio gene prediction software to identify 
novel candidate genes associated with the disease. Further, the use of demographic, 
epidemiologicai and clinical data in more sophisticated models helps explain much of the 
trait variation in a popniatioiL Reducing the overall variation in this way increases the 

30 power to detect genetic variation: The identification of millions of SNPs allows finer 
mapping in any given region of the genome and direct association testing of very large 
case/control populations, thereby reducing the need to study families and more directly 
identify flxe degree to which any genetic variant affects a given population. Finally, our 
understanding of disease and the need to subphenotype a given disease is now more fully 



4 



6 



wo 2004/061616 PCTAJS2003/041613 

£^reciated and aids in reducing the heterogeneity of the disease under study. 
Technologies such as microarrays have greatly facilitated the abiUty to subclassify 
disease subtypes for a given disease. However, all of the methods still fall short when it 
comes to effici^tly identifying genes and pathways associated with diseases. 

5 

23. OBESITY ^ 
Obesity represents the most prevalent of body weight disorders, and it is the most 
ixnpotimt nutritional disorder in the western world, with estimates of its prevalence 
ranging fiom 30% to 50% within the middle-aged population. Other body weight 
LO disorders, such as anorexia nervosa and bulimia nervosa, which together affect 

approximately 0.2% of the female population of the western world, also pose serious 
health threats. Further, such disorders as anorexia and cachexia (wasting) are also 
prominent features of other diseases such as cancer, cystic fibrosis, and AIDS. 

It has been estimated that half of all Americans are overweight. Within the United 
15 States about 24% of men and 27% of women are defined as mildly to severely obese. 
Individuals 20% over ideal weight guidelines are considered obese. Obesity is classified 
as mild (20-40% overweight), moderate (41-100% overweight), and severe (>100%) 
overweight. Severe obesity is relatively rare, affecting less than 0.5% of all obese 
individuals and about 0. 1 % of the total population. 

20 In order to measure obesity, the weight/height ratio may be calculated by 

obtaining the weight of an individual in kilograms (kg) and dividing this value by the 
square of the height of the individual in meters. Alternatively, the weight/height ratio of 
an individual may be obtained by multiplying the weight of the individual in pounds (lbs) 
by 703 and dividing this value by the sqiiare of the height of the individual (in inches 

25 (in)). These ratios are typically referred to as BMI. Thus, BMI=kg/m^ or BMI=(lbs. x 
703)/(in)^ Where BMI is utilized as a measiireofobesity, an individual is considered 
overweight when BMI values range between 25.0 and 29.9, Obesity is defined as BML 
values greater than or equal to 30.0. The World Health Organization assigns BMI values 
as follows: 25.0-29.9, Grade I obesity (moderately overweight); 30-39.9, Grade n obesity 

30 (severely overweight); and 40.0 or greater. Grade IH obesity (niassive/morbid obesity). 
Using weight tables, obesity is classified as mild (20-40% overweight), moderate (41- 
100% overweight), and severe (>100%) overweigjit Individuals 20% over ideal weight 



7 



wo 2004/061616 ^ PCT/US2003/041613 

guidelines are considered obese. Individuals 1-19.9% over ideal weight are classified as 
^overweight. 

Obesity also contributes to other diseases. For example, this disorder is 
responsible for increased incidence of diseases such as coronary artery disease, 

5 hypertension, stroke, diabetes, hyperlipidemia, and some cancers (See, eg., Nishina, P. 
M. et al., 1994, Metab. 43: 554-558; Grundy, S. M. & Bamett, J. P., 1990, Dis. Mon. 36: 
641-73 1). Obesity is not merely a behavioral problem, i.e., the result of voluntary 
hyperphagia. Rather, the differential body composition observed between obese and 
normal subjects results from dififerences in both metabolism and neurologic/metabolic 

10 interactions. These differences seem to be, to some extent, due to differences in gene 
ejqpiession, and/or level of gene products or activity (Friedman, J. M. et al., 1991, 
Mammalian Gme 1: 130-144). 

The epidemiology of obesity strongly shows that the disorder exhibits inherited 

■ ■ • • 

characteristics (Stunkard, 1990, N. Eng. J. Med. 322: 1438). Moll et al have reported 
15 that, in many populations, obesity seems to be controlled by a few genetic loci (Moll et 
al^ 1991, Am. J. Hum. Gen. 49: 1243). In addition, human twin studies strongly suggest 
a substantial genetic basis in the control of body weight, with estimates of heritability of 
80-90% (Simopoulos, A. P. & Childs, B., eds., 1989, in "Genetic Variation and Nutrition 
in Obesity'*, World Review of Nutrition and Diabetes 63, S. Karger, Basel, Switzerland; 
20 Borjeson, M., 1976, Acta. Paediatr. Scand 65: 279-287). 

In other studies, non-<)bese persons who deUberately attempted to gam wei^^ 
systematically over-eating were found to be more resistant to such weigjit gain and able to 
maintain an elevated weigjit only by very high caloric intake. In contrast, spontaneously 
obese individuals are able to maintain their status with normal or only moderately 
25 elevated caloric intake. In addition, it is a commonplace experience in aiiimal husbandry 
that different strains of swine, cattle, etc., have different predispositions to obesity. 
Studies of the genetics of human obesity, and of animal models of obesity demonstrate 

■ 

that obesity results from complex defective regulation of both food intake, food induced 
energy expenditure, and of the balance between hpid and lean body anabolism. 

30 There are a number of genetic diseases in man and oth^ species that feature 

obe^ty among their more prominent symptoms, along with, frequently, dysmorphic 
features and mental retardation. For example, Prader-Willi syndrome (PWS; reviewed in 
KnoU, J. HL et al., 1993, Am. J. Med. Genet 46: 2-6) affects approximately 1 in 20,000 



wo 2004/061616 PCTAJS2003/041613 

live births, and involves poor neonatal muscle tone, facial and genital deformities, and 
generally obesity. 

In addition to PWS, many other pleiotropic syndromes have been characterized 
that include obesity as a symptom. Hiese syndromes are genetically straightforward, and 
5 appear to ij^volve autosomal recessive alleles. Such diseases include, among others, 
Ahlstroem, Carpenter, Bardet-Biedl, Cohen, and Morgagni-Stewart-Monel Syndromes. 

A number of models exists for the study of obesity (see, e.g,, Bray, G. A., 1992, 
Prog. Brain Res. 93: 333-341; and Bray, G. A., 1989, Amer. J. Clin. Nutr. 5: 891-902). 
For example, animals having mutations that lead to syndromes that include obesity 
10 sjonptoms have also been identified. Attenipts have been made to utilize such animals as 
models for the study of obesity, and the best studied animal models to date for genetic 
obesify are mice. For reviews, see, e.g., Friedman, J. M. et al., 1991, Mamm. Gen. 1 : 130- 
144; Friedman. J. M. and liebel, R. L., 1992, Cell 69: 217-220. 

Studies utilizing mice have confirmed that obesity is a very complex trait with a 
1 5 higjh degree of heritability. Mutations at a number of loci have been identified that lead to 
. obese phenotypes. These include the autosomal recessive mutations obese (ob), diabetes 
(db), fat (fat), and tubby (tub). 

Thus, given the above background, what is needed in the art are improved 
methods for identifying genes and biological pathways that affect complex traits such as 
20 diseases. In particular genes and biological pathways that affect obesity, which poses a - 
major, worldwide health problem, are needed 

Discussion or citation of a reference herein will not be construed as an admission 
that such reference is prior art to the present invention. 

25 3. SUMMARY OF THE INVENTION 

The present invention provides an improvement over the art by uniquely 
combining gene expression ^)proaches with genetic approaches in order to determine the 
genes associated with traits, such as complex human diseases. In the computer systems 
and mediods of the present invention, genetic approaches are used to filter out false 

30 positive genes from gene expression clusters. Furthermore, the computer systems and 
methods of the present invention are used to. advantageously combine gene expression 
data with genetics data to elucidate biological pathways associated with traits. 

9 



wo 2004/061616 



PCT/US2003/041613 



3.1. ASSOCIATING A TRAIT WITH A GENE IN A FIRST SPECIES BY USING 

eQTL-cQTL OVERLAP IN A SECOND SPECIES 

One aspect of the invention provides a method for associating a gene G in the 

♦ 

5 genome of a first species with a clinical trait T exhibited by the first species and a second 
species. In the method a gene G' is found in the second species that is an ortholog of the 

gene G. Further, an expression quantitative trait loci (eQTL) is identified for gene G' 

*v ■ 

using a first quantitative trait loci (QTL) analysis. The first QTL analysis uses a pluraUty 
of expression statistics for gene G^ as a quantitative trait £ach expression statistic in the 

10 plurality of expression statistics represents an expression value for the gene G' in an 
organism in a plurality of organisms of the second species. A clinical quantitative trait 
loci (cQTL) that is linked to the clinical trait T is identified using a second QTL analysis. 
The second QTL analysis uses a plurality of phenotypic values as a quantitative tr^t. 
Each phenotypic value in the plurality of phenotypic values represents a phenotypic value 

1 S for the clinical trait T in an organism in the plurality of organisms of the second species. 
Next, a determination is made as to whether the eQTL and the cQTL colocaJize to the 
same locus in the genome of the second species. When the eQTL and the cQTL 
colocalize to the same locus, the gene G is associated with the clinical trait T in the first 
species. 

20 la some embodimaits, the above described determining step fiirther comprises 

determining whether the locus of the eQTL in the genome of the second species 
corresponds to the physical location of the gene G' in the genome of the second species. 
When the locus of the eQTL in the goiome of the second species corresponds to the 

« 

physical location of the gene G^ in the g^ome of the second species the gene G is 
25 . associated with the clinical trait T. 

In some embodiments, the eQTL corresponds to the physical location of flie gene 
G' when the eQTL and the gene G' colocalize within about 3cM or within about IcM of 
each other in the genome of the second species. In some embodiments, the metbod 
fiuther comprising testing whether a colocalization of the eQTL and the cQTL is caused 
30 by pleiotropy. In some embodiments, the first QTL analysis and the second QTL analysis 
each uses a genetic marker map that represents the genome of the second species. - 

Some embodiments of the present invention include an additional stq) that is 
performed prior to the first identifying step. This additional step comprises constructing 

10 



wo 2004/061616 PCT/US2003/041613 

the genetic marker map from a set of geaetic markers associated with a pliirality of 
organisms representing the second species. In some embodiments, the set of genetic 
maikers comprises nucleotide polymorphisms (SNPs), microsatellite markers, restriction 
fragment length polymorphisms, short tandrai repeats, DNA methylation markms, 
5 sequence length polymorphisms, random amplified polymorphic DNA, amplified 

fragment length polymorphisms, or simple sequence repeats. In some embodiments, the 
genotype data is used in the constructing step and the genotype data comprises knowledge 
of which alleles, for each marker in the set of genetic markers, are present in each 
. organism in the plurality of organisms representing the second species. In some 

10 . embodiments, the pliiraUtyoforganisms representing the second species represents a 
segregating population and pedigree data is used in the constructing step. Furth^, this 
pedigree data shows one or more relationships betsveen organisros in the plurality of 
organisms representing the second species. In some embodiments, the pluraHty of 
organisms representing the second species comprises an F2 population, a Ff population, a 

15 F2:3 population, or a Design III population and the one or more relationships between 
organisms in the plurality of organisms representing the second species indicates which 
organisms in the plurality of organisms represmting the second species are members of 
the F2 population, the F/ population, the F2;3 population, or the Design HI population. 

In some embodiments, each expression value is a noraialized expression level 
20 measurement for the gene G' in an organism in the plurality of organisms of the second 
species. In some embodiments, each such egression level measurement is determined by 
measuring an amount of a cellular constituent encoded by the gene in one or more 
cells from an organism in the plurality of organisms of the second species. In some 
embodiments, the amount of the cellular constituent comprises an abundance of an RNA 
25 present in the one or more cells of the organism. In some embodiments, the abimdance of 
the RNA is measured by a method comprising contacting a gene transcript array with the 
RNA from the one or more cells of the organism, or with nucleic acid derived fit>m the 
RNA. In such raibodiments, the gene transcript array comprises a positionally 
addressable suifece with attached nucleic acids or nucleic acid mimics. These nucleic 
30 acids or nucleic acid mimics are capable of hybridizing with the RNA species, or with 

• * ^ » . ■ 

nucleic acid derived from the RNA species. The normalized expression level 
measurement is obtained by a normalization technique such as, for example, 2^score of 
intensity, median intensity, log median intensity, Z-score standard deviation log of 
intensity, Z-score mean absolute deviation of log intensity, calibration DNA gene set, 

11 



wo 2004/061616 PCT/DS2003/041613 

user normalization gene set, ratio median intensity correction, or intensity background 
correction. 

In some embodiments, the QTL analysis comprises (i) testing for linkage between 
(a) the genotype of the plurality of organisms of the second species at a position in the 

5 genome of the second species and (b) the plurality of expression statistics for gene G'; (ii) 
^advancing the position in the genome by an amount; and (iii) repeating steps (i) and (ii) 
until aU or a portion of the genome of the second species has been tested. In some 
- embodiments the amount is less than 100 centiMorgans, less than 10 centiMorgans, less 
than 5 centiMorgans, or less than 2.5 centiMorgans. In some embodiments, the testing of 

10 (i) comprises performing linkage analysis or association analysis. In some embodiments, 
linkage analysis or association analysis generates a statistical score for the position in the 
genome of the second species. In some embodiments, testing is linkage analysis and the 
statistical score is a logarithm of the odds (lod) score. In some embodiments, the eQTL is 
represented by a lod score that is greater than about 2.0, that is greater than about 3.0, that 

15 is greater than about 4.0, or that is greater than about 5.0. 

In some embodiments, the second QTL analysis comprises (i) testing for linkage 
between (a) the genotype of the plurality of organisms of the second species at a position 
in the genome of the second species and (b) the plurality of phenotypic values; (ii) 
advancing the position in the gmome by an amount; and (iii) repeating steps (i) and (ii) 

20 until all or a portion of the genome of the second species has been tested. The amount 
advanced can be, for example, less than 100 centiMorgans, less fliart 10 centiMorgans, 
less than 5 centiMorgans, or less than 2.5 centiMorgans. The testing of (i) in the second 
QTL analysis may comprise performing linkage analysis or association analysis. This 
linkage analysis or association analysis generates a statistical score for the position in the 

25 genome of the second species. In some onbodiments, the testing is linkage analysis and 
the statistical score is a logarithm of the odds (lod) score. In some embodiments, the 
cQTL is represented by a lod score that is greater than about 2.0, by a lod score that is 
greater than about 3.0, by a lod score that is greater than about 4.0, or by a lod score that 
is greater than about 5.0. 

■ * 

• • • . 

30 In some embodiments, the first species is himan. In some embodiments, the 

second species is a plant or an animal. In some embodiments, the second species is com, 
beans, rice, tobacco, potatoes, tomatoes, cucumbers, apple trees, orange trees, cabbage, 
lettuce, or wheat In some anbodiments, the second species is a m ammal , a primate, 

12 



wo 2004/061616 PCT/US2003/041613 

mice, rats, dogs, cats, chickens, horses, cows, pigs, or monkeys. In still other 
embodiments, the second species is Drosophila, yeast, a virus, or Caenorhabditis elegans. 

In some embodiments, the clinical trait T is a complex trait. In some 
embodiments, the complex ^trait T is characterized by an allele that exhibits incomplete 

5 penetrance in the second species. In some embodiments, the complex trait is a disease 
that is contracted by an organism in the plurality of organisms of the second species. 
Further, the organism inherits no predisposing allele to the disease. In some 
embodiments, the complex trait arises when any of a plurality of diflferent genes in the 
genome of the second species is somatically mutated. In some embodiments, the 

10 complex trait requires the simultaneous presence of mutations in a plurality of genes in 
the genome of the second species. In still other enibodiments, the complex trait is 
associated with a high frequency of disease-causing alleles in the second species. In yet 
other embodiments, the complex trait is a phenotype that does not exhibit MendeUan 
recessive or dominant inheritance attributable to a geae locus. In some embodiments the 

15 complex trait is susceptibility to heart disease, hypertension, diabetes, cancer, infection, 
polycystic kidney disease, early-onset Alzheimer's disease, maturity-onset diabetes of the 
young, hereditary nonpolyposis colon cancer, ataxia telangiectasia, obesity, nonalcoholic 
steatohepatitis (NASH), nonalcoholic fatty liver (NAFL), or xeroderma pigmentosum. 

In some embodiments, the eQTL and the <^QTL colocalize to the same locus in the 
20 genome of the second species when the physical location of the eQTL in the genome is 
within about 40 cM of the phj^cal location of the cQTL in flie genome, within about 20 
cM of the physical location of the cQTL in the genome, within about 10 cM of the 
physical location of the cQTL in the genome, or within about 6 cM of the physical 
location of the cQTL in the genome. 

25 Another aspect of the present invention provides a computer program product for 

. use in conjunction with a compute system. The computer program product comprises a 
computer readable storage medium and a computer program mechanism embedded 
therein. The computer program mechanism associates a gene G in the genome of a first 
species with a clinical trait T exhibited by the first species and a second species. The 

30 computer program mechanismi comprises an ortholog identification module for finding a 
gene G' in the second species that is an ortholog of the gene G. The computer program 
mechanism fiirther comprises an expression quantitative trait loci (eQTL) identification 
module for identifying an expression quantitative trait loci (eQTL) for the gene G^ using a 



13 



wo 2004/061616 PCT/US2003/041613 

j5rst quantitative trait loci (QTL) analysis. The first QTL analysis uses a plurality of 
expression statistics for the gene G' as a quantitative trait and each expression statistic in 
the plurality of expression statistics represents an e3q)ression value for the gene in an 
organism in the plurality of organisms of the second species. The computer program 
mechanism further con[q)rises a clinical quantitative trait loci (cQTL) identification 
module for identifying a clinical quantitative trait loci (cQTL) that is linked to the clinical 
trait T usiag a second QTL analysis. The second QTL analysis uses a plurality of 
phenotypic values as a quantitative trait. Each phenotypic value^in the plurality of 
phenotypic values represents a phenotypic value for the clinical trait T in an organism in 
the plurality of organisms of the second species. The computer program mechanism also 
comprises a determination module for determining whether the eQTL and the cQTL 
colocalize to the same locus in the genome of the second species, wherein, when the 

* 

eQTL and the cQTL colocalize to the same locus, the gene G is associated with the 
clinical trait T iii the first species. 

Another aspect of the present invention provides a computer system for 
associating a gene G in the genome of a first species with a clioical trait T exhibited by 
the first species and a second species. The computer system cozr^ses a central 
processing unit and a memory coupled to the central processing imit. ITie memory stores 
an ortholog identification module, an expression quantitative trait loci (eQTL) 
identification module, a clinical quantitative trait loci (cQTL) identification module, and a 
determination module. The ortholog identification module comprises instructions for 
finding a gene G' in the second species that is an ortholog of the gene G. The expression 
quantitative trait loci (eQTL) identification module comprises instructions for identifying 
an expression quantitative trait loci (eQTL) for the gene G^ using a first quantitative trait 
loci (QTL) analysis. The first QTL analysis uses a plurality of expression statistics for 
gene G' as a quantitative trait. Each expression statistic in the plurality of expression 
statistics represents an expression value for the gene G' in an organism in a plurality of 
organisms of the second species. The clinical quantitative trait loci (cQTL) identification 
module comprises instructions for identifying a clinical quantitative trait loci (cQTL) that 
is linked to the clinical trait T using a second QTL analysis. The second QTL analysis 
uses a plurality of phenotypic values as a quantitative trait Each phenotypic value in the 
plurality of phenotypic values represents a phenotypic value for the clinical trait T in an 
organism in the plurality of organians of the second species. The determination module 
comprises instructions for detemiining whether the eQTL and the cQTL colocalize to. the 



wo 2004/061616 PCT/nS2003/041613 

same locu3 in the genome of the second species. When the eQTL and the cQTL 
colocalize to the same locus, the gene G is associated with the cUnical trait T. 

« 

3^. ASSOCIATING A QTL WITH A COMPLEX TRAIT IN A FIRST SP£CI£S 
5 BY CLUSTERING QTL DATA FROM A SECOND SPECIES 

Another aspect of the present invention provides a method for associating a gene 
G in the genome of a first species with a clinical trait T exhibited by the first species and 
a second species. In the method, quantitative trait locus data &om a plurality of 
quantitative trait locus analyses is clustered to form a quantitative trait locus interaction 

10 map. Each quantitative trait locus analysis in the pluraUty of quantitative trait locus 
analyses is performed for a gene in a plurality of genes m the genome of the second 
species using a genetic marker map and a quantitative trait in order to produce the 
quantitative trait locus data. For each quantitative trait locus analysis, the quantitative 
trait comprises an expression statistic for the gene for which the quantitative trait locus 

1 5 analysis is performed. The genetic marker map is constructed fix>m 'a set of genetic . 
markers associated with the plurahty of organisms of the second species. The 
quantitative trait locus interaction map is analyzed to identify a gene G' associated with a 
trait. Then, the gene G in the first species that is the ortholog of the gene G' of the 
second species is identified, thereby associating a gene G in the genome of the first 

20 species with a clinical trait T exhibited by the first species. In some embodiments, the 
method fiirther comprises an additional step that is performed prior to the clustering step. 
This additional step comprises performing each of the quantitative trait locus analyses in 
the plurality of quiantitative trait locus analyses. 

In some embodiments, the expression statistic for gene G' is computed by a 
25 method comprising transforming an expression level measurement of gene G' firom each 
organism in the plurality of organisms of the second species. In some embodiments, each 
quantitative trait locus analysis comprises: (i) testing for linkage between a position in a 
chromosome, in the genome of the second species, and the quantitative trait used in the 
quantitative trait locus analysis; (ii) advancing the position in the chromosome by an 

30 ainount; and (iii) repeating steps (i) and (ii) until all or a portion of the genome has been 

• - .. 

tested. In some embodiments, the quantitative trait locus data produced firom each 
respective quantitative trait locus analysis comprises a logarithmic of the odds score 
computed at each position tested. In some embodiments, the testing coniprises 
performing linkage analysis or association analysis. 



15 



wo 2004/061616 _PCT/DS2003/041613 

In some embodiments, the clustering of the quantitative trait locus data jfrom each 
quantitative trait locus analysis comprises applying a hierarchical clustering technique, 
applying a k-means technique, ^plying a fiizzy k-meaos technique, applying a Jarvis- 
Patrick clustering, applying a self-organizing map technique, or applying a neural 
5 network technique. 

Some embodiments of the me&od include an additional step in which a gene 
e}q)ression cluster m^ is constmcted from each expression statistic created by the 
transforming step. In some embodiments, construction of the gene expression cluster 
map comprises creating a plurality of gene expression vectors, each gene expression 
1 0 vector in the plurality of gene e;q)ression vectors representing an expression level 

■ • 

measurement of a gene, in the plurality of genes in the genome of the second species, in 
each of the plurality of organisms of the second species. Then, a plurality of correlation 
coefticients are computed. Each correlation coefticient in the plurality of correlation 
coefficients is computed between a gene expression vector pair in the plurality of gene 

IS expression vectors. The plurality of gene expression vectors are clustered based on the 
plurality of correlation coefficients in order to form tiie gene expression cluster msp. In 
some embodiments, the step of analyzing the quantitative trait locus interaction map 
comprises filtering the quantitative trait locus interaction map in order to obtain a 
candidate pathway group. The filtering comprises identifying a quantitative trait locus in 

20 the candidate pathway group in the gene egression cluster map. In some embodiments, 
construction of the gene expression cluster map comprises (i) creating a plurality of gene 
e?q>ression vectors, each gene expression vector in the pluraUty of gene expression 
vectors representing a gene in fixe plurality of genes, (ii) confuting a plurality of metrics, 
wherein each metric in the plurality of metrics is conq)uted between a gene expression 

25 vector pair in the ptoality of gene expression vectors; and (iii) clustering the plurality of 
gene expression vectors based on the plurality of metrics in order to form' the gene . 
expression cluster map. In some embodiments, the plurality of genes comprises at least 
five genes. 

Another embodiment of the present invention provides a computer program 
20 product for use in conjunction with a computer system. Hie computer program product 
conq)rises a computer readable storage medium and a computer program mechanism 
embedded therein. The computer program mephanism comprises a clustering module, an 
analysis module, and an ortholog identification module. The clustering module is used 
for clustering quantitative trait locus data fiom a plurality of quantitative trait locus 

16 



wo 2004/061616 PCTAJS2003/041613 

analyses to fonn a quantitative trait locus interaction map. Each quantitative trait locus 
analysis in the plurality of quantitative trait locus analyses is performed for a gene in a 
plurality of genes in the genome of a second species using a genetic marker map and a 
quantitative trait in order to produce the quantitative trait locus data. Por each 
S quantitative trait locus analysis, the quantitative trait comprises an expression statistic for 
the gene for which the quantitative trait locus analysis is performed, for each organism in 
a plurality of organisms of the second species. Further, the genetic marker map is 
constructed fbom a set of genetic maikers associated with tiie plurality of organisms of the 
second species. The analysis module is for analyzing the quantitative trait locus 
10 interaction map to identify a gene G' associated with a trait exhibited by a first species 
and the second species. The ortholog identification module is for finding a gene G in the 
first species that is an ortholog of the gene G' ia the second species. 

Some embodiments of the present invention provide a computer system for 
associating a gene G in the genome of a first species with a clinical trait T exhibited by 

IS the first species and a second speciSs. The computer syston comprises a central 

processing unit and a memory coupled to the central processing unit The memory stores 
a clustering module, an analysis module and an ortholog identification module. The 
clustering module is for clustering quantitative trait locus data firom a plurality of 
quantitative trait locus analyses to form a quantitative trait locus interaction map. Each 

20 quantitative trait locus analysis in the plurality of quantitative trait locus analyses is 

performed for a gene in a plurality of genes in the genome of the second species using a 
genetic marker map and a quantitative trait in order to produce the quantitative trait locus 
data. For each quantitative trait locus analysis, the quantitative trait comprises an 
expression statistic for the g&nc for which the quantitative trait locus analysis is 

25 performed, for each organism in a plurality of organisms of the second species. The 
g^etic marker map is constructed firom a set of genetic markers associated with the 
plurality of organisms of the second species. The analysis module is for analyzing the 
quantitative trait locus interaction map to identify a gene G' associated with a trait 
exhibited by the first species and the second species. The ortholog identification module 

30 is for finding a gene G in the first species that is an ortholog of the gene G' of the second 
species. 



17 



wo 2004/061616 PCT/DS2003/041613 

3.3. ASSOCIATING A COMPLEX TRAIT WITH A GENE IN A FIRST SPECIES 

BY SUBDIVIDING A SECOND SPECIES POPULATION 

• 

Another aspect of the present invention provides a method for identifying a 
quantitative trait locus for a complex trait in a first species. The complex trait is exhibited 

5 by the &st species and a second species. In the meSconT^ plurality of organisms of the 
second species are divided into a plurahty of subpopulations using a classification scheme 
that classifies each organism in the plurality of organisms of the second species into at 
least one of the subpopulations. The classification scheme uses a plurality of celitdar 
constituent measurements firom each organism of the second species. Further, for at least 

10 one subpopulation in the plurality of subpopulations, the method provides the step of 
performing quantitative genetic analysis on the subpopulation in order to identify a 
quantitative trait locus for the complex trait ia the second species. The method further 
provides the step of findmg the quantitative trait loci in the first species that is the 
ortholog of the quantitative trait locus of the second species, thereby identifying the 

1 5 quantitative trait locus for the complex trait in the first species. 

In some embodiments, the complex trait is a disease that is contracted by an 
organism in the first species or the second species where the organism inherits no 
predisposing allele to the disease. In some embodiments, the complex trait arises when 
any of a plurality of differait genes in the genome of the first species or the second 
20 species is mutated In some embodimCTts, the complex trait is associated with a high 
fi^quency of disease-causing alleles in first species or the second species. 

In some embodiments, the complex trait is a phenotype that does not exhibit 
Mendelian recessive or dominant inheritance attributable to a gene locus. In some 
embodiments, the complex trait is susceptibility to heart disease, hypertension, diabetes, 
25 cancer, iofection, polycystic kidney disease, early-onset Alzheimer's disease, maturity- 

• . « 

onset diabetes of flie young, hereditary nonpolyposis colon canc«:, ataxia telangiectasia, 
obesity, or xeroderma pigmentosum. 

hi some embodiments, the plurality of cellular constituent measurements firom 
each organism of the second species comprises the measurement of the cellular 
30 constituent levels often or more ceUularcoiistituente in each orgaiusm In some 
embodiments, the dividing step comprises determining whether a class predictor is 
available, and when a class predictor is available, using a supervised classification 
scheme to classify each organism in the plurality of organisms of the second species into 
a subpopulation in the plurahty of subpopulations. When a class predictor is not 

18 



L 



wo 2004/061616 PCT/US2003/041613 

available, an unsupervised classification scheme is used to classify each organism in the 
plurality of organisms of the second species into a subpopulation in the plurality of 
subpopulations. 

In some embodiments, Ihe classification scheme is a supervised classification 

> 

5 scheme. In some embodiments, the classification scheme is an unsupaidsed 

classificiation scheme. In some embodiments, the unsupervised classification scheme is a 
hi^archical cluster analysis that uses a nearest-neighbor algorithm, a farthest-neighbor 
algorithm, an average linkage algorithm, a centroid algorithm, or a sum-of-squares 
algorithm to determine the similarity between (i) the pluraUty of cellular constituent 

10 measurements firom one organism in the pluraUty of organisms of the second species and 
(ii) the plurahty of cellular constituent measurements from another organism in the 
plurality of organisms of the second species. 

Another aspect of the present invention provides a computer program product for 
use in conjimction with a computer system. The computer program product comprises a 

1 5 computer readable storage medium and a computer program mechanism embedded 

therdn. The computer program mechanism comprises a classification module, a genetic 
analysis module, and an ortholog identification module. The classification module is for 
dividing a plurality of organisms of a second species into a plurality of subpopulations 
using a classification scheme that classifies each organism in the plurality of organisms of 

20 the second species into at least one of the subpopulations. The classification scheme uses 
a plurality of cellular constituent measurements firom each organism in the second 
species. The genetic analysis module is used, for at least one subpopulation in the 
plurahty of subpopulations, to perform quantitative genetic analysis on the subpopulation 
in order to identify a quantitative trait locus for a complex trait that is exhibited by the 

25 second species and a first species. The ortholog identification module is used for finding 
the quantitative trait locus in the first species that is the ortholog of the quantitative trait 
locus of Hie second species. 

* 

Another aspect of the present invention provides a computer system for 
identifying a quantitative trait locus for a complex trait in a first species. The complex 
30 trait is exhibited by the first species and a second species. The computer system 

comprises a central processing unit and a memory coupled to the central processing imit 
The memory is used for storing a classification module, a genetic analysis module, and an 
ortholog identification module. The classification module includes instructions for 

19 



wo 2004/061616 PCTAJS2003/041613 

dividing a plurality of organisms of a second species into a plurality of subpopulations 
using a classification scheme that classifies each organism in the plurality of organisms of 
the second species into at least one of the subpopulations. The classification scheme uses 
a plurality of cellular constituent measurements firom each organism in titie second 

5 species. The genetic analysis module includes instractions that, for at least one 

subpopulation in the plurahty of subpopulations, performs quantitative genetic analysis 
on the subpopulation in order to identify the quantitative trait locus for the complex trait. 
The ortholog identification module comprises instructions for finding the quantitative trait 
locus in the first species that is the ortholog of the quantitative trait locus in the second 

10 species. 



3.4. OBESITY RELATED GENES AND OBESITY RELATED GENE 

PRODUCTS 

Another aspect of the present invention provides a method for dete rminin g 

1 5 whether a candidate molecule affects a body weight disorder associated with an organism. 
The method comprises the step of (a) contacting a cell firom the organism with, or 
recombinantly expressing within the cell firom the organism, the candidate molecule. The 
method finrther comprises the step of (b) determining whether the RNA expression or 
protein e7q>ression in the cell of at least one open reading firame is changed in step (a) 

20 relative to tlie expression of the open reading firame in tiie absence tiie candidate 

molecule. Each open reading firame is regulated by a promoter native to a nucleic acid 
sequence selected firom the group consisting of SEQ ID NO: 1, SEQ ID NO: 2, SEQ ID 
NO: 3, SEQ ID NO: 9, SEQ ID NO: 12, SEQ ID NO: 13, SEQ ID NO: 16, SEQ ID NO: 
19, SEQ ID NO: 20, and homologs of each of the foregoing. The method further 

25 comprises the step of (c) determining that the candidate molecule affects a body weight 
disorder associated with the organism when the RNA expression or protein expression of 
the at least one op^ reading firame is changed, or determining that the candidate 
molecule does not affect a body weight disorder associated with the organism when the 
KNA expression or protein expression of the at least one open reading frame is 

30 unchanged. In some embodiments, the body weight disorder is obesity, anorexia nervosa, 
bulimia n^osa or cachexia. 

In some embodiments, the cell firom the organism that is contacted with the 
caadidate molecule exhibits a lower expression level of a protein sequence selected firom 
the group consisting of SEQ ID NO: 4, SEQ ID NO: 5, SEQ ID NO: 6, SEQ ID NO: 7, 

* 

20 



If 

» ■ 



wo 2004/061616 PCT/US2003/041613 

SEQ ID NO: 8, SEQ ED NO: 10, SEQ ID NO: 11, SEQ ID NO: 14, SEQ ID NO: 1 5, SEQ 
ID NO: 17, SEQ ID NO: 18, SEQ ID NO: 21, SEQ ID NO: 22, SEQ ID NO: 23, SEQ ID 
NO: 24, SEQ ID NO: 25, SEQ ID NO: 26, SEQ ID NO: 27, SEQ ID NO: 28, and SEQ ID 
iNfO: 29 relative to a cell from the organism that is not contacted with the candidate 

5 molecule. Jxl some embodiments, step (b) comprises determining whether RNA 

expression is changed In some embodiments, step (b) comprises detemiining whether 
protein expression is changed. In some embodiments, step (b) comprises determining 
whether RNA or protein expression of at least two of the open reading frames is changed. 
In some embodiments, step (a) comprises contacting the cell with the candidate molecule 

10 and step (a) is carried out in a Hquid high throughput-hke assay. 

In some embodiments, the cell comprises a promoter region of at least one gene 
selected from the group consisting of SEQ ID NO: 1, SEQ ID NO: 2, SEQ ID NO: 3, 
SEQ ID NO: 9, SEQ ID NO: 12, SEQ ID NO: 13, SEQ ID NO: 16, SEQ ID NO: 19, SEQ 
ID NO: 20, and homologs of each of the foregoing, each promoter region being pperably 

15 hnked to a marker gene. Further, step (b) comprises detennining whether the RNA 
expression or protein expression of the marker gene(s) is changed in step (a) relative to 
the expression of the marker gene in the absence of the candidate molecule. Illustrative 
marker genes include, but are not limited to, green fluorescent protein, red fluorescent 
protein, blue fluorescent protein, luciferase, LEU2, LYS2, ADE2, TRPl, CANl, CYH2, 

20 GUS, CUPl and chloramphenicol acetyl transferase. 

Another aspect of the invention provides a method of identifying a molecule that 
specifically binds to a hgand selected from the group consisting of (i) a protein encoded 
by a gene selected from the group consisting of SEQ ID NO: 1, SEQ ID NO: 2, SEQ ID 
NO: 3, SEQ ID NO: 9, SEQ ID NO: 12, SEQ ID NO: 13, SEQ ID NO: 16, SEQ ID NO: 

25 19, SEQ ID NO: 20, and homologs of each of the foregoing, and (ii) a biologically active 
fragment of SEQ ID NO: 4, SEQ ID NO: 5, SEQ ID NO: 6, SEQ ID NO: 7, SEQ ID NO: 
8, SEQ ID NO: 10, SEQ ID NO: 1 1, SEQ ID NO: 14, SEQ ID NO: 15, SEQ BD NO: 17, 
SEQ ID NO: 18, SEQ ID NO: 21, SEQ ID NO: 22, SEQ ID NO: 23, SEQ ID NO: 24, 
SEQ ID NO: 25, SEQ ID NO: 26, SEQ ID NO: 27, SEQ ID NO: 28, and SEQ ID NO: 29. 

30 The method comprises the step of (a) contacting the ligand with one or more candidate 
molecules under conditions conducive to binding between the Ugand and the candidate 
molecules. The method further comprises the step of (b) identifying a molecule within 
the one or more candidate molecules that binds to the hgand 



21 



wo 2004/061616 PCT/US20.03/041613 

Another aspect of the inventioii provides a purified protein comprising the amino 
acid sequence of SEQ ID NO: 8, Still another aspect of the invention provides a purified 
protein encoded by a nucleic acid hybridizable to a DNA having a sequence consisting of 
the coding region of SEQ ID NO: 2. Yet another aspect of the invention provides a 
S purified protein comprising an amino acid sequence that has at least 90% identity to the 
amino acid sequence set forth in SEQ BD NO: 8, in which percentage identity is 
determined over an amino acid sequence of identical size as SEQ ID NO: 8. Still another 
aspect of the invention provides a purified protein comprising an amino acid sequence 
that has at least 95% identity to the amino acid sequence set forth in SEQ ID NO: 8, in 
10 which percentage identity is determined over an amino acid sequence of identical size as 
SEQ ID NO: 8. 

AnothCT embodiment of the present invention provides an isolated nucleic acid 
comprising the nucleotide sequence of SEQ ID NO: 2, a coding region of SEQ ID NO: 2, 
SEQ ID NO: 3, a coding region of SEQ ID NO: 3, or the complement of any of the 

15 foregoing. In some embodiments, this nucleic acid is DNA. Another embodiment of the 
present invention provides an antibody that binds to a protein consisting of the amino acid 
sequence of SEQ ID NO: 8. This antibody may be monoclonal. Another embodiment of 
the present invention provides a molecule comprising a firagment of the antibody of claim 
258, which firagment binds a protein consisting of the amino acid sequence of SEQ ID 

20 NO: 8. 

Another embodiment of the present invention provides a method of treating or 
preventing a body weight disorder. The method comprises administering to a subject in 
which treatment is desired a therapeutically effective amount of a molecule that inhibits a 
function of one or more of the group consisting of SEQ ID NO: 8, SEQ ID NO: 1 1, SEQ 

25 ID NO: 15, SEQ ID NO: 18, SEQ BD NO: 25, SEQ ID NO: 26, and SEQ ID NO: 27. In 
some embodiments, the subject is human. In some embodiments, the molecule that 
inhibits a fimction of one or more of the group consisting of SEQ ID NO: 8, SEQ ID NO: 
11, SEQ ID NO: 15, SEQ ID NO: 18, SEQ ID NO: 25, SEQ ID NO: 26, and SEQ ID NO: 
27 is selected &om the group consisting of an antibody that binds to one of SEQ ID NO: 

30 8, SEQ ID NO: 1 1, SEQ ID NO: 15, SEQ ID NO: 18, SEQ ID NO: 25, SEQ ID NO: 26, 

• * 

and SEQ ID NO: 27 or a fragment or derivative therefore containing the binding region 
thereof a nucleic acid complementary to the RNA produced by transcription of a gene 
encoding one of SEQ ID NO: 8, SEQ ID NO: 1 1, SEQ ID NO: 15, SEQ ID NO: 18, SEQ 
ID NO: 25, SEQ ID NO: 26, and SEQ ID NO: 27. In some CTibodiments, the molecule 

22 



* 



wo 2004/061616 PCT/US2003/041613 

that inhibits a function of one or more of the group consisting of SEQ ID NO: 8, SEQ ID 
NO: 1 1, SEQ ID NO: .15, SEQ ID NO: 18, SEQ ID NO: 25, SEQ ID NO: 26, and SEQ ID 
NO: 27 is an oligonucleotide that (a) consists of at least six nucleotides; (b) comprises a 
sequence complementary to at least a portion of an RNA transcript of a gene encoding 
5 one of SEQ ID NO: 8, SEQ ID NO: 11, SEQ ID NO: 15, SEQ ID NO: 18, SEQ ID NO: 
25, SEQ ID NO: 26 or SEQ ID NO: 27; and (c) is hybridizable to the RNA transcript 
under moderately stringent conditions. 

. - 
Yet another aspect of the present invention provides a method of treating or 
preventing a body weight disorder. The method comprises administering to a subject in 
10 which treatment is desired a therapeutically effective amount of a molecule diat enhances 
a function of one or more of the group consisting of SEQ ID NO: 8, SEQ ID NO: 1 1, 
SEQ ID NO: 15, SEQ ID NO: 18, SEQ ID NO: 25, SEQ ID NO: 26, and SEQ ID NO: 27. 
Iq some embodiments the subject is human. 

Still another aspect of the present invention provides a method of diagnosing a 
IS disease or disorder or the predisposition to the disease or disorder. The disease or 

disorder is charactadzed by an aberrant level of one of SEQ ID NO: 1 through SEQ ID 
NO: 29 in a subject. The method comprising measuring the level of any one of SEQ ID 
NO: 1 through SEQ ID NO: 29 in a sample derived from the subject, in which an increase 
or decrease in the level of one of SEQ JD NO: 1 through SEQ ID NO: 29 in the sample, 
20 relative to the level of one of the SEQ ID NO: 1 through SEQ ID NO: 29 found in an 
analogous sample not having the disease or disorder^ indicates the present of the disease 
or disorder in the subject. In some embodiments the disease or disorder is a body weigiht 
disorder. In some embodiments, the body weight disorder is obesity, anorexia nervosa, 
bulimia nervosa, or cachexia. 

25 Another aspect of the present invention provides a method of diagnosLag or 

screening for the presence of or predisposition for developing a disease or disorder 
involving a body weight disorder in a subject. The method comprises detecting one or 
more mutations in at least one of SEQ ID NO: 1 through SEQ ID NO: 29 in a sample 
derived from tiie subject, in which the presence of the one or more mutations indicates the 

30 presence of the disease or disorder or a predisposition for developing the disease or 
disordCT. 

Still another aspect of the present invention provides a recombinant non-human 
animal that is the product of a process comprising introducing a nucleic add encoding at 

23 



wo 2004/061616 PCT/US2003/041613 

least a domain of one of SEQ ID NO: 8, SEQ p NO: 15, SEQ ID NO: 18, SEQ ID NO: 
25, SEQ ID NO: 26, and SEQ ID NO: 27 into the recombinant non-human animal. 



3.5. USING CROSS SPECaOES DATA TO ASSOCIATE GENES WITH TRAITS 
5 OF BSTEREST 

One embodiment of the present invention provides a method for confinning the 
association of a query QTL or a query gene in the genome of a second species with a 
clinical trait T exhibited by the second species. The method comprises (a) finding a frst 
QTL or a first gene in a first species that is linked to a trait T', wherein trait T ' is 

10 indicative of trait T; (b) mapping a region of the genome of the first species that 

comprises the first QTL or the first gene to a region of the genome of the second species; 
and (c) finding a query QTL or a query gene in the second species that is potentially 
associated with the trait T, wherein the potential association of the query QTL or the 
query gene with the clinical trait T is confirmed when the query QTL or the query gene is 

15 in the region of the genome of the second species. 

■ 

In some embodiments, the finding step (a) comprises (i) crossing a first strain and 
a second strain of the first species in order to obtain a segregating population; (ii) 
stratifying the segregating population into a plurality of subpopulations, wherein a 
subpopiilation in the plurahty of subpopulations represents a phenotypic extreme of the 

20 trait T'; (iii) using cellular constituent measurements firom organisms in the plurahty of 
subpopulations to identify a cellular constituent set that exhibits a cellular constituent 
measur^ent pattern associated with the phenotypic extreme; (iv) clustering the 
segregating population based on measurements of Hie cellular constituent set in organisms 
in the segregating population to obtain a plurahty of population clust^; and (v) for at 

25 least one population cluster in the plurahty of population clusters, performing 

quantitative genetic analysis on the population cluster in order to find the first QTL or the 
first gene in the first species that is linked to the trait T^ 



3.6. USING CAUSALITY IN A FIRST SPECIES TO ASSOCIATE GENES OR 
30 LOCI WITH TRAITS OF INTERESI IN A SECOND SPECIES 

' One aspect of the present invention provides a method of identifying a molecular 
target for a second trait in a second species. In the method a first gene in a segregating 
population that is causal for a first trait exhibited by all or a portion of the segregating 



24 



wo 2004/061616 PCT/US2003/O41613 

population is identified. Each member of the segregating population is a member of a 
first species and the second trait in the second species corresponds to the fijrst trait in the 
first species. The first gene in the first species is mapped to a corresponding locus in the 
genome of the second species. Next, a determination is made as to whether a marker or a 
5 haplotype in the corresponding locus in the genome of the second species associates with 
the second trait Then the marker or the haplotype associates with the second trait in the 
second species, the locus is identified as the molecular target. 

In some embodiments, the marker or the haplotype is in a second gene in the 
corresponding locus and the second gene is identified as the molecular target In some 

10 onbodiments, the first gene and the second gene are orthologous. In some embodiment 
the first gene in the segregating population that is causal for the first trait exhibited by all 
or a portion of the segregating population is identified by a method that comprises: (a) 
identifying a test gene ia the first species that has at least one abundance quantitative trait 
locus (eQTL) coincident with a respective clinical quantitative trait locus (cQTL) for the 

I S first trait; and (b) testing, for one or more respective eQTL in the at least one eQTL, 
whether (i) the genetic variation of the eQTL across the segregating population and (ii) 
the variation of the first trait across the segregating population are correlated conditional 
on an abundance pattern of the test gene across the segregating population. When the 
genetic variation of (1) one or more respective eQTL tested in step (b) and (2) the 

20 variation of the first trait across the segregating population are correlated conditional on 
an abundance pattern of the test gene across tiie segregating population, the test gene is 
identified as the first gene. In some embodiments, the second species is mammalian. In 
particular, in some embodiments, the second species is human. In some embodiments, 
the second trait is asthma, ataxia telangiectasia, bipolar disorder, cancer, common lat&* 

25 onset Alzheimer's disease, diabetes, heart disease, h^editary early-onset Alzheimer's 
disease, hereditary nonpolyposis colon cancer, hypertension, infection, maturity-onset 
diabetes of the young, mellitus, migraine, nonalcoholic &tty Uver, nonalcoholic 
steatohq>atitis, non-insulin-dependent diabetes mellitus, obesity, polycystic kidney 
disease, psoriases, schizophrCTia, or xeroderma pigmentosum. In some embodiments, the 

30 molecular target is a gene, an exon, an intron, or a regulatory element of a gene. In some 
embodiments the marker is a single nucleotide polymorphism, a microsatellite marker, a 
restriction firagment length polymorphism, a short tandem repeat, a DNA methylation 
marker, a sequence length polymorphism, a random amplified polymorphic DNA, an 
amplified fi-agment length polymorphisms, or a siicple sequence repeat 



25 



wo 2004/061616 PCT/US2003/041613 

Another aspect of the present invention provides a method of identifying a 
molecular target for a. second trait in a second species. In the method, a first gene in a 
segregating population that is causal for a first trait exhibited by all or a portion of the 
segregating population is identified. Here, each member of the segregating population is 

S a member of a first species and the second trait in the second species corresponds to the 
first trait in the first species. A locus in the genome of the second species that is (1) 
linked to the second trait and (2) maps to the position in the genome of the first species 
where the first gene resides is identified. Finally, a determination is made as to whether a 
marker or a haplotype in the corresponding locus in the genome of the second species 

0 associates with the second trait. When the marker or the haplotype associates with the 
second trait in the second species, the locus is identified as the molecular target. In some 
embodiments, the marker or the haplotype is in a second gene in the corresponding locus 
and the second gene is identified as the molecular target In some embodiments, the first 
gene and the second gene are orthologous. In some embodiments, the method of 

S identifying the first gene in the segregatiag population fiiat is causal for the first trait 

exhibited by all or a portion of the segregating population comprises: (a) identifying a test 
. gene in the first species that has at least one abundance quantitative trait locus (eQTL) 
coincident with a respective clinical quantitative trait locus (cQTL) for the first trait; and 
(b) testing, for one or more respective eQTL in the at least one eQTL, whether (i) the 

0 genetic variation of the eQTL across the segregating population and (ii) the variation of 
the fiocst trait across the segregating population are correlated conditional on an abundance 
pattem of the test gene across the segregating population. When the genetic variation of 
(1) the one or more respective eQTL tested in step (a) and (2) the variation of the first 
trait across the segregating population are correlated conditional on an abundance pattem 

5 of the test gene across the segregating population, the test gene is idmtified as the first 
gene. 

Another aspect of the invention provides a method of identifying a molecular 
target for a second trait in a second species. In the methods, a first gme in a segregating 
population that is causal for a first trait exhibited by all or a portion of the segregating 
0 population is identified Here, each member of the segregating population is a member of 
a first species and the second trait in the second species corresponds to the first trait in the 
first species. A second gene in the genome of the second species that is orthologous to 
the first gene is identified such that the variation of the abimdance of the second gene 
aoross biological samples taken firom a plurahty of members of the second species and (ii) 

26 



I - ■ : 

t 

t 

; 

WO 2004/061616 PCT/US2003/041613 

' the variation of the second trait across the plurality of members of the second species are 

associated. This second gene is the sought after molecular target. In some embodiments, 
the method further comprises validating the second gene by determining whether a 
marker or a hq>lotype in the second gene associates with the second trait. When the 

S marker or the haplotype associates, with the second trait in the second species, the second 
gene is validated. In some instances, the first gene in a segregating population that is 
causal for a first trait exhibited by all or a portion of the segregating population is 
identified by (a) firtHing a test gene in the first species that has at least one abundance 
quantitative trait locus (eQTL) coincident with a respective clinical quantitative trait locus 

10 (cQTL) for the first trait; and (b) testing, for one or more respective eQTL in the at least 
one eQTL, whether (i) the genetic variation of the eQTL across the segregating . 
papulation and (ii) the variation of the first trait across the segregating population are 
correlated conditional on an abundance pattern of the test gene across the segregating 
population. When the genetic variation of (1) tiie one or more respective eQTL tested in 

15 step (b) and (2) the variation of the first trait across the segregating population are 
correlated conditional on an abundance pattern of the test gene across the segregating 
population, the test gene is identified as the first gene. 

Still another aspect of the present invention provides a computer system for 
identifying a molecular target for a second trait in a second species. The computer system 

20 comprises a central processing unit and a memory coupled to the central processing unit. 
The memory stores instructions for identifying a first gene in a segregating population 
that is causal for a first trait exhibited by all or a portion of the segregating population. 
Each member of the segregating population is a member of a first species and the second 
trait in the second species corresponds to the first trait in the first species. The memory 

25 further includes instructions for mapping the first gene in ttie first species to a 

corresponding locus in the genome of the second species. The memory further includes 
instructions for determining whether a marker or a haplotype in the corresponding locuis 
in the g^ome of the second species associates with the second trait. 

Yet another aspect of the present invention provides a computer program product 

30 for use in conjunction with a computer system, the computer program product comprising 
a computer readable storage mediuih and a computer program mechamsm embedded 
herein. The computer program mechanism comprises instructions for identifying a first 
gene in a segregating population that is causal for a first trait exhibited by all or aportion 
of the segregating population. Each menaber of the segregating population is a member 

27 



wo 2004/061616 PCT/US2003/041613 

of a first species and the second trait in the second species corresponds to the first trait in 
the first species. The computer program mechanism further includes instructions for 
mapping the first gene in the first species to a corresponding locus in the genome of the 
second species. The computer program mechanism also includes instructions or 
5 determining whether a marker or a haplotype in the corresponding locus in the genome of 
the second species associates with the second trait. 

Still another embodiment of the present invention provides a coicputer program 
product for use iq conjunction with a computer system, the computer program product 
comprising a computer readable storage medium and a computer program mechanism 

1 0 embedded therein. The computer program mechanism comprises instructions for 

, identifying a first gene in a segregating population that is causal for a first trait exhibited 
by all or a portion of the segregating population. Each member of the segregating 
population is a member of a first species and the second trait in the second species 
. corresponds to the first trait in the Gist species. The computer program mechanism 

1 5 fiurther includes instructions for identifying a locus in the genome of the second species 
that is (1) linked to the second trait and (2) maps to the position in the geaome of the first 
species where the first gene resides. The computer program.mechanism further 
comprises instructions for determining whether a marker or a h^lotype in the 
corresponding locus in the genome of the second species associates with the second trait. 

20 Yet another embodiment of the invention provides a computer system for identifying a 
molecular target for a second trait ia a second species. The computer system comprises a 
central processing unit and a memory, coupled to the central processing unit. The 
memory stores instructions for idmtifying a first gene in a segregating population that is 
causal for a first trait exhibited by all or a portion of the segregating population. Each 

25 member of the segregating population is a member of a first species. Fiulher, the second 
trait in the second species corresponds to the first trait in the fijrst species. The memory 
further stores instructions for identifying a second gene in the genome of the second 
species Ihat is orthologous to the first gene such that (i) the variation of the abundance of 
the second gene across biological sample taken &om a plurality of m^bers of the 

30 second species and (ii) the variation of the second trait across the plurality of members of 
the second species are associated. 

Still another aspect of the invention provides a compute program product for use 
in conjunction with a computer systenL The computer program product comprises a 
computer readable storage medium and a computer program mechanism embedded 



28 



10 



wo 2004/061616 PCT/DS2003/041613 

therein. The computer program mechanism comprises instructions for identifying a first 
gene in a segregating population that is causal for a first trait exhibited by all or a portion 
of the segregating population. Each member of the segregating population is a member 
of a first species. The second trait in the second species corresponds to the first trait in 
the first species. The computer program mechamsm fiirther includes instructions for 
identifying a second gene in the genome of the second species that is ordiologous to the 
first gene such that (i) the vaiiatipn of the abundance of the second gene across biological 
samples taken. fix>m a plurality of m^b^ of the second species and (ii) the variation of 
the second trait across the plurality of members of the second species are associated. 



4. BRIEF DESCRIFnON OF THE DRAWINGS 

Fig. 1 illustrates a computer system for associating a gene with a trait qchibited by 
one or more organisms in a plurality of organisms in accordance with one embodiment of 
the present invention. 

IS Fig. 2 illustrates processing steps for associating a gene with a trait exhibited by 

one or more organisms in a plurality of organisms of a single species using a clustering 
^proach, IQ accordance with an embodimmt of the present invention. 

Fig. 3 A illustrates an expression / genotype warehouse in accordance with one 

■ 

embodiment of the present invention. 

20 Fig. 3B illustrates a gene expression statistic found in an expression / genotype 

warehouse in accordance with one embodiment of the present invention. 

It. ■ * 

Fig. 3C illustrates an expression / genotype warehouse in accordance with another 
embodiment of the present invention. 

Fig. 4 illustrates quantitative trait locus results database in accordance with one 
25 embodiment of the present inventioiL 

Fig 5 illustrates genetic crosses used to derive a mouse model for a complex 
human disease in accordance with one embodiment of the present invention. 

Fig. 6 provides a histogram for p-values of segregation analyses performed on 
2,126 genes across four CEPH families in accordance with one embodiment of tiie 
30 present inventioiL 



29 



wo 2004/061616 PCT/US2003/041613 

Fig. 7 illustrates expression quantitative trait loci ("eQTL") identified for a 
diversity of transcript abundance polymorphisms in accordance with one embodiment of 
the present inventioiL 

Fig. 8 highlights a range of gene-centered polymorphisms known to exist between 
5 DBA and B6 mouse strains, in accordance with one embodiment of the present invention. 

■ 

. Fig. 9 illustrates how quantitative trait loci analysis using gene expression as a 
quantitative trait can detect a quantitative trait loci for a gene that has a higher copy 
number in one parent than the other, in accordance with one embodiment of the present 
invention. 

10 Fig. 10 illustrates how the use of e^qpression data as a quantitative trait can detect 

differential sphcing, in accordance with one embodiment of the present invention. 

Fig. 1 1 illustrates the pathways associated with nicotinate and nicotinamide 
metabolism in accordance with the prior art. 

Fig. 12 provides a key for important enzymes in tiie pathways associated with 
1 5 nicotinate and nicotinamide metabolism that are illustrated in Fig. 11. 

Fig. 13 illustrates how the use of expression data as a quantitative trait can detect 
nonsense mutations, in accordance with one embodiment of the present invention. 

Fig. 14 illustrates the results of a QTL analysis in a region of mouse chromosome 
11 for the phoiotyjpic traits "free fatty acid" (curve 1402) and "triglyceride level" (curve 
20 1404), in accordance with one embodiment of the present inventioiL 

Fig. 15 illustrates expression QTL ("eQTL") from several genes that are known to 
be involved with glucose and lipid metabolism which overlap with the "free fatty acid" 
and '^triglyceride level" clinical trait QTL ("cQTL") on chromosome 1 1, in accordance 
with one embodiment of the present inventioiL 

25 Fig. 16 shows a scatter plot that breaks down the mean log ratios for the mouse 

peroxisome proliferator activated receptor (PPAR) binding protein by mouse genotype at. 
flie chromosome 1 1 location across the F2 mouse population (120 F2 mouse livers) that 
. was profiled in accordance with one embodimmt of the present invention. 

Fig. 17 shows a scatter plot that breaks down the mean log ratios for the mouse 
30 PPAR binding protein by mouse genotype at the chiomosome IS location across die F2 



30 



wo 2004/061616 PCT/US2003/041613 

mouse population (120 F2 mouse livers) that was profiled in accordance with one 
embodiment of the present invention. 

Fig. 1 8 is a plot that illustrates how genes known to be involved in lipid 
metabolism are linked by eQTL analysis to the same genetic locus, even though they 
5 physically reside at different xmlinked locations. 

Fig. 19 illustrates processing steps for associating a gene G in the genome of a 
siagle species with a clinical trait T that is exhibited by one'or more organisms in a 
plurality of organisms of the single species, in accordance with an embodimisnt of the 
present invention. 

1 0 Fig. 20 illustrates clinical quantitative trait loci (cQTL) for four mouse obesity- 

related traits that co-localize with the expression QTL (eQTL) for four genes at a QTL 
hot spot on mouse chromosome 2, in accordance with an embodiment of the present 
invention. 

Fig. 21 illustrates a plurality of phenotypic statistics sets, in accordance with an 
15 embodiment of the present invention. . 

Fig. 22 illustrates computing modules in accordance with an embodiment of the 
present invention. f 

Fig. 23 illustrates the hierarchical clustering of 123 genes that are linked to a 
particular chromosome 2 locus or are higihly correlated with genes that are linked to this 
20 locus (x-axis), against the hierarchical clustering of F2 mice in the highest and lowest 
quartile for the phenotype "subcutaneous fat pad mass" (y-axis), in accordance with one 
embodiment of the present invention. 

i?ig. 24 illustrates a hypothetical example in vvhich a biological pathway that 
ajffects the coinplex trait obesity is deduced, in accordance with one embodiment of the 
25 present invention. 

Fig. 25 illustrates sequence-based processing steps for identifying wi ortholog in a 
reference species to a gene associated with a complex trait in a target species in 
accordance with one embodiment of the present invention. 

• # * 

Fig. 26 illustrates nonsequence-based processing stq)s for identifytag an ortholog 
30 in a reference species to a gene associated with a complex trait in a target species in 
accordance with one embodiment of the present invention. 



31 



wo 2004/061616 PCT/US2003/041613 

Fig. 27 iUiistrates the nucleotide sequence of the Mus musadus gene NM_025575 
(SEQ ID NO: 1), also known as the 261 0042O14Rik gene. 

Fig. 28 illustrates the human mKNA that corresponds to the Mus musculus gene 
NM_025575, a corrected form of AL591714.1 (SEQ ID NO: 2). 

^ustrates the portion of SEQ ID NO: 2 that codes for the human protein 
that corresponds to Mus musculus NP_07985 1 (SEQ ID NO: 3). 

Figs. 30A through SOD illustrate four Mus musculus amino acid sequences that 
correspond to the 2610042O14Rik gene, namely Q9CQK0 (SEQ ID NO: 4), Q9CYM5 
(SEQ K) NO: 5), Q9CYX5 (SEQ ID NO: 6), and Q9DAU8 (SEQ ID NO: 7). 

) Fig. 3 1 illustrates the human amino acid sequence that corresponds to the 

corrected form of accession number AL491714 (Fig. 28) (SEQ ID NO: 8). 

Fig. 32 illustrates the nucleotide sequence for the Mt4s musculus gene NM_015731 
(SEQ ID NO: 9), which is also known as ATP9A. 

Fig. 33 illustrates the human relationships field of a LocusLink query of ATP9A, 
5 indicating a human chromosomal location of 20ql 3.11-13.2 

Fig. 34 illustrates the nucleotide and protein sequence associated with human 
chromosome position 20ql3: 11-13.2 

Fig. 35 illustrates the amino acid sequence of the Mus musculus protein 
NM_015731 (SEQ ID NO: 10) that corresponds to th&Mus musculus gene NML01573L 

) Fig. 36 illustrates the human ortholog to the Mus musculus pxot&bi NM_01S731, 

namely 075 1 10 (SEQ ID NO: 11). 

Fig. 37 illustrates the inferred mRNA sequence for 075 110 (SEQ ID NO: 12). 

Fig. 38 illustrates the Mus musculus gene NM_025996 (SEQ ID NO: 13). 

Figs. 39A and 39B illustrate the LocusLink record for the Mus musculus gene 
5 NM_025996 (SEQ ID NO: 13). 

Fig. 40 illustrates the Qiouse protein that corresponds to the Mus musculus gene 
NM_025996 (SEQ ID NO: 13), which is NP_080272 (SEQ ID NO: 14). 

• • • 

Fig. 41 illustrates the human protein NP_0O68OO (SEQ ID NO: 15) which 
conesponds to the mouse protein NP_080272 (SEQ ID NO: 1 4). 



32 



wo 2004/061616 PCT/US2003/041613 

Fig. 42 illustrates the humaa nucleotide sequence NM_006809.2 (SEQ ID NO: 

16) , which corresponds to the human protein NP_006800 (SEQ ID NO: 1 5). 

Fig. 43 illustrates the Mus musculus protein sequence TR_Q9CYG7 (SEQ ID NO: 

1 7) which is a TrEMBL database entry which is nearly identical to the Mus musculus 
5 protein NP_080272 (SEQ ID NO: 14). 

Fig. 44 illustrates the human protein TR_Q15785 (SEQ ID NO: 18) which 
coiresponds to the human nucleotide sequence NM_006809 (SEQ ID NO: 16). 

Fig. 45 illustrates the nucleotide sequence for the Mus musculus gene NM_009564 
(SEQ ID NO: 19). 

10 Fig. 46 illustrates the human nucleotide sequence NM_018197.1 (SEQ ID NO: 

20), which coiresponds to the Mus musculus gene NM_009564 (SEQ ID NO: 19). 

Fig. 47 illustrates the Mus musculus amino acid sequence NP_033590 (SEQ ID 
NO: 21), which corresponds to the Mus musculus gene NM_009564 (SEQ ID NO: 19). 

Fig. 48 illustrates the Mus musculus TrEMBL database entry TRJP97365 (SEQ 
15 ID NO: 22), which corresponds to the Mus musculus gene NM_009564 (SEQ ID NO: 
19). 

Fig. 49 illustrates tiie Mus musculus TrEMBL database entry TR_Q99KE8 (SEQ 
ID NO: 23), which corresponds to the Mis musculus gene NM_009564 (SEQ ID NO: 
19). 

20 Fig. 50 illustrates the Mus musculus TrEMBL database entry .TR_Q9CWR3 (SEQ 

ID NO: 24), which corresponds to the Mus musculus gene NMJ)09564 (SEQ ID NO: 
19), 

Fig. 51 illustrates the human amino acid sequence NP_060667 (SEQ ID NO: 25) 
that coiresponds to Hie human nucleotide sequence NM_018197 (SEQ ID NO: 20). 

25 Fig. 52 illustrates the TrEMBL database entry TR_Q9NPA5 (SEQ ID NO: 26) 

that also corresponds to the Mm musculus g&as NM_009564 (S£Q ID NO: 19). 

Fig. 53 illustrates the TrEMBL database entry TR_Q9NTS7 (SEQ ID NO: 27) 

, • "■ 

that also corresponds to the Mus musculus gene NM_009564 (SEQ ID NO: 19). 

Fig. 54 illustrates an embodim^t of a QTL analysis module in accordance with 
30 the present invention. 



33 



* 



« 



wo 2004/061616 PCT/US2003/O41613 

Fig. 55 illustrates an experimental protocol in which two inbred strains that 
exhibit polymorphic behavior with respect to a trait of interest are crossed to obtain an F2, 
back-cross or other such segregating population in order to gisnerate a population where 
flie trait of interest is segregating, in accordance with one embodiment of the present 
5 invention. 

Fig. 56 illustrates an experimental protocol in which a segregating population is 
genotyped based upon a marker map, scored with respect to phenotypes associated with 
the trait of interest, and tissues relevant to the trait under study are isolated from samples 
of the segregating population for expression profiling, in accordance with some . 
10 embodiments of the present invention. 

Fig. 57 illustrates the identification of those subgroups within a whole population 
that exhibit different trait subtypes, in accordance with one embodiment of the present 
invention. 

Fig. 58. illustrates the identification of subtypes within a population that are most 
15 extreme with respect to the trait under study using microarray expression data in 
' accordance with one embodiment of the present invention. ■ 

Fig. 59 is a depiction of a two-dimensional cluster of the most differentially 
expressed set of genes in mice comprising the upper and lower 25* percentiies of the 
subcutaneous fat pad mass (FPM) trait in a segregating population, in accordance with 
20 one embodiment of the present invention. 

Figs. 60 and 61 illustrates that one subgroup of mice is not under the control of a 
particular QTL, but that another subgroup of mice is under the control of the QTL in a 
given segregating population. 

Fig. 62 illustrates a congenic strain that is constracted from two inbred strains, B6 
25 and CAST, where B6 serves as the background strain and CAST serves as the donor 
straiiL 

Fig. 63 illustrates two hypothetical QTL that are linked to a human obesity-related 
risk trait The hypothetical QTL are on human chromosome 8 and are mapped to a 
portion of mouse chromosorae IS.using a syntenic map between human and mouse in 

* • * 

30 accordance with one embodiment of the present invention. 

Fig. 64 lists four lod score curves for obesity-related traits in mouse. 



34 



wo 2004/061616 PCT/US2003/041613 

Fig. 65 is an overlay of two hypothetical human QTL and two noiouse QTL that 
shows that the peaks of the two hypothetical human QTL are aligned with the peaks of 
the two mouse QTL. 

Fig. 66 is an illustration showing that the chromosome 13 region in mouse is a 
5 hotspot activity for eQTL linkage with hundreds of eQTL linking to one of two BMI QTL 
peaks. 

Fig. 67 illustrates a method for directly validating genes identified in mice using 
association methods in human populations in accordance with one embodiment of the 
present invention. 

10 Fig. 68 plots the percentage of eQTL at different lod score thresholds across 920 

evenly-spaced bins, each 2cM wide, covering the mouse genome in a quantitative genetic 
analysis performed in accordance with one embodiment of the present invention. 

Fig. 69 illustrates a drug discovery paradigm in accordance with one embodimOTt 
of the present invention. 

i 

[5 Fig. 70 illustrates an in vivo RNAi strategy m accordance with one embodiment of 

the present invention* 

Fig. 71 illustrates processing steps for subdividing a disease population P into n 
subgroups in accordance with a prefeired embodiment of the present invention. 

Fig.. 72 illustrates a data structure that comprises that data used to identify cellular 
to constituents that discriminate a trait imder study. 

Fig. 73 illustrates the classification of a trait of interests into subtraits in 
accordance with one embodiment of the present invention. 

Fig. 74 illustrates a topology for how causal genes affect pathways that affect a 
primary disease which, in turn, affects reactive genes. 

;5 Fig. 75 A illustrates possible relationships between quantitative trait loci (QTL), 

genes and disease traits once the expression of the gene (G) and the disease trait (T) have 
bera shown to be under the control of a common QTL (Q). 

Fig. 75B illustrates obese and lean aimnals segregating with the genotypes given 
at the locus, with up arrows indicating up regulation of the gene, horizontal arrows 
0 indicating no differential regulation, and down arrows indicating down regulation. 



35 



wo 2004/061616 PCT/US2003/041613 

Fig. 75C illustrates an analysis of the observed correlation structure between the 
locus, gene expression trait, and obesity trait of Fig. 3B under a causal model. 

Fie. 75r '^i^istrates an analysis of the observed correlation structure between the 
locus, gene expression trait, and obesity trait of Fig. 3B under a reactive model. 

5 Fig. 75E illustrates an analysis of the observed correlation structure between flie 

locus, gene expression trait, and obesity trait of Fig. 3B under an independent model. 

Fig. 76 illustrates the genomic positions of the cQTL timt are Unked to the trait 
omental fat pad masses (OFPM) as well as the eQTL that are linked to ejqpression of the 
gene HSDl in a segregating mouse population. 

10 Fig. 77 illustrates a potential relationship between a specific QTL (which controls 

for both the trait OFPM and HSDl expression), HSDl, and OFPM. 

Fig. 78 illustrates LOD score curves for HSDl expression, the trait OFTM, the 
simultaneous consideration of HSDl expression and the trait OFPM, as well as OFPM 
after conditioning on HSDl expression. 

15 Fig. 79 illustrates processing steps for identifying a gene that affects a trait in 

accordance with one embodiment of the present invention. 

Fig. 80 illustrates the data structure for phenotypic statistic sets in accordance with 
one embodiment of the present invmtLon. 

Fig. 81 illustrates a data structure for storing cellular constituent abundance data 
20 in accordance with one embodiment of the present invention. 

Fig. 82 illustrates the data structure for a cellular constituent egression statistic in 
accordance with one embodiment of the present invration. 

Fig. 83 illustrates a data structure for storing cellular constituent abundance data 
from a plurality of different tissue types in accordance with one embodiment of the 
25 present invention. 

Fig. 84 illustrates a QTL results database in accordance with the present invention 

■ « 

Figs. 85A-85E illustrates several possible genetic relationships. 

Figs. 86A-86D depict cQTL for several clinical traits in the BXD ctoss (HDL, 
plasma insulin levels, plasma leptin levels, and epididymal fet mass) located on murine 
30 chromosome 13. 



36 



wo 2004/061616 PCT/US2003/041613 

Fig. 87 highlights a subset of the genes whose expression in the liver of F2 
animals derived from a cross of C57BL/6J of DBA/2J mice are controlled by the 
chromosome 13 cQTL given in Figure 86. 

Fig. 88 is the human amino acid sequence of the choiecystokinin type A receptor 
5 (CCKAR) SEQ ID NO: 30. 

Fig. 89 highli^ts a lod score curve on human chromosome 4 for percent body fat 
in females in an Icelandic population. 

Fig. 90 highlights two related human haplotypes that are strongly associated with 
percent body fat in females of an Icelandic population. 

10 Fig. 91 illustrates a set of haplotypes identified in the CCKAR gene that are 

significantly associated with thinness in an Icelandic population 

Like reference numerals refer to corresponding parts throughout the several views 
of the drawings. 

15 5. DETAILED DESCRIPTION 

The present invention provides an apparatus and method for associatmg a gene 
with a trait exhibited by one or more organisms in a plurality of organisms of a single first 
species. In some embodiments, experimental data is used from a single second species. 
Exemplary single first species and single second species include, but are not limited to, 

20 plants and animals, frx typical embodiments, the single first species and the single second 
species are different species. In specific embodiments, the exemplaxy single first species 
and the single second species are each drawn from a list of species that includes, but is 
not limited to plants such as com, beans, rice, tobacco, potatoes, tomatoes, cucimxb^, 
£^le trees, orange trees, cabbage, lettuce, and wheat In specific embodiments, 

25 exemplary organisms include, but are not limited to animals such as mammals, primates, 
humans, mice, rats, dogs, cats, chickens, horses, cows, pigs, and monkeys. In yet othCT 
. specific embodiments, the single first species and/or the single second species are each 
selected fixjm the group consisting of Drosophila, yeast, viruses, and Caenorhabditis 
elegans (C. elegans). 

30 In some instances, the gene is associated with the trait by identifying a biological 

pathway in which the gene product participates, hi some embodiments of the present 
invention, the trait of interest is a complex trait such as a human disease. Exemplary 

37 



wo 2004/061616 PCTAJS2003/O41613 

hunum diseases include, but are not limited to, diabetes, obesity, cancer, asthma, 
schizophrenia, arthritis, multiple sclerosis, and rheumatosis. In some embodiments, the 
trait of interest is a preclinical indicator of disease, such as, but not limited to, high blood 
pressure, abnoraial triglyceride levels, abnormal cholesterol levels, or abnormal high- 

5 density lipoprotein / low-density lipoprotein levels. In a specific embodiment of the 
present invention, the trait is low resistance to an infection by a particular insect or 
pathogen. Additioxial exemplary diseases are found in Section 5.12, zn/j-a Itithe 
invention, the expression level measurement of each gene in each of a plurality of 
organisms is transformed into a corresponding expression statistic. An "expression level 

10 measurement" of a gene can be, for example, a measurement of the level of its encoded 
RNA (or cDNA) or proteins or activity levels of encoded proteins. In some 
embodiments, ibis transfoimation is a normalization routine in which raw gene expression 
data is normalized to yield a mean log ratio, a log intensity, and a background-corrected 

• * 

intensity. Further, a genetic marker map 78 (Fig. 1) is constructed from a set of genetic 
1 5 markers associated with the plurality of organisms. Then, for each gene G in a pluraUty 
of genes expressed by an organism in the population, a quantitative trait locus (QTL) 
analysis is performed using the genetic marker map in order to produce QTL data, A set 
of ejqjression statistics represents the quantitative trait used in each QTL analysis. QTL 
analyses axe explained in greater detail, infra^ in conjunction with Fig. 2, element 210. 
20 This set of expression statistics, for any given gene G, comprises an expression statistic 
for gene G, for each organism in ttxe plurahty of organisms. Next, the QTL data obtained 

• 

from each QTL analysis is clustered to form a QTL interaction map. Identification of 
tightly clustered QTLs in the QTL interaction map helps to identify genes that are 
genetically interacting. This information, in turn, helps to elucidate biological pathwaj^ 
25 that are aJSected by complex traits, such as human disease. In some embodiments of the 
presCTLt invention, tightly clustered QTLs in the QTL interaction map are considered 
candidate pathway groups. These candidate pathway grpups are subjected to multivariate 
analysis in order to verify whether the genes in tiie candidate pathway group affect a 
particular complex trait 

30 One embodiment of the present invention provides a method for associating a 

gene with a trait exhibited by one or more organisms in a plurality of organisms of a 
single species. In the method, quantitative trait locus data from a plurahty of quantitative 
trait locus analyses are clustered to form a quantitative trait locus interaction map. Each 
quantitative trait locus analysis in the pluraUty of quantitative trait locus analyses are 



10 



wo 2004/061616 PCT/US2003/041613 

performed for a gene G in a plurality of genes in the genome of the plurality of organisms 
using a genetic marker map and a quantitative trait in order to produce the quantitative 
trait locus data. For each quantitative trait locus analysis, the quantitative trait comprises 
an expression statistic for the g^e G for which the quantitative trait locus analysis has 
beexi performed, for each organism in the plurality of organisms. The genetic marker map 
is constructed firom a set of genetic markers associated with the plurality of organisms. 
Further, in &e method, the quantitative trait locus interaction map is analyzed to id^tify 
a gene associated with a trait, thereby associating the gene witii the trait exhibited by one 
or more organisms in the plurality of organisms. 



5.1. OVERVIEW OF THE INVENTION 
Fig. 1 illustrates a system 10 that is operated in accordance with one embodiment 
of the present invention. In addition. Fig. 2 illustrates the processing' steps that are 
performed in accordance with one embodiment of the present invention. These figures 
1 S will be referenced in this section in order to disclose the advantages and features of the 
present invention. System 10 comprises at least one computer 20 (Fig. 1). Computer 20 . 
comprises standard components including a central processing unit 22, memory 24 
(including high speed random access memory as well as non--volatile storage, such as disk - 
. storage) for storing prograiji modules and data structures, user input/output device 26, a 
20 network interface 28 for coupling server 20 to other computers via a communication 
network (not shown), and one or more busses 34 that interconnect these components. 
User input/output device 26 comprises one or more user input/output components such as : 
a mouse 36, display 38, and keyboard 8. 

Memory 24 comprises a number of modules and data stractures that are used in 
25 accordance with the present invention. It will be appreciated that, at any one time during 
operation of the system, a portion of the modules and/or data structures stored in memory 
24 is stored in random access memory while another portion of the modules and/or data 
structures is stored in non-volatile storage. In a typical ^bodiment, memory 24 
comprises an operating system 40. Operating system 40 comprises procedures for 
30 handling various basic system services and for performing hardware dependent tasks. 
Memory 24 further comprises a file syston 42 for file management In some 
tents, file system 42 is a component of operating system 40. 



^iii»:tiiiiii 



39 



wo 2004/061616 PCT/US2003/041613 

. The present invention begins with gene expression data 44 (e.g-, from a gene 
expression study or a proteomics study) and a genotype and/or pedigree data 68 from an 
experimental cross or human cohort under study (Fig. 1 ; Fig. 2, step 202). In one 
embodiment, gene expression data 44 consists of &e processed microarray images for 
5 each individual (organism) 46 in a population imder study. In some embodiments, such 
data comprises, for each individual 46, intensity information 50 for each gene 48 
represented on the array, background signal information 52, and associated annotation 
information 54 describing the gene probe (Fig. 1). In some embodiments, gene 
expression data is, in fact, protein levels for various proteins in the organisms 46 under 

10 study, hi one aspect ofthepresentinvention, the expression level of a gene in an 

organism in the population of interest is detexmined by measuring an amount of at least 
one cellular constituent that corresponds to the gene in one or more cells of the organism. 
As used herein, the term "cellular constituent" comprises individual genes, proteins, 
mKNA expressing a gene, and/or any other variable cellular component or protein 

15 activity, degree of protein modification (e.g, , phosphorylation), for example, that is 
typically measiured in a biological experim^t by those skilled in the art. Although for 
simplicity the disclosure often makes reference to single cells, it will be understood by 
those of skill in the art that more often any particular step of the invention will be carried 
out using a plurality of genetically similar cells, e.g. , from a cultured cell line. Such 

20 similar cells are referred to herein as a "cell type.'' la one embodiment, the amount of the 
at least one cellular constituent that is measmied comprises abundances of at least one 
RNA species present in one or more cells. Snch abundances may be measured by a 
method comprising contacting a gene transcript array with SNA from one or more cells 
of the organism, or with cDNA derived therefrom. A gene transcript array comprises a 

25 surface with attached nucleic acids or nucleic acid mimics. The nucleic acids or nucleic 
acid mimics are capable of hybridizing with the RNA species or with cDNA derived from 
the RKA species, hi some embodiments, gene expression data 44 is taken from tissues 
that have been associated with the complex trait under study. For example, in one 
nonlimitiug embodiment where the complex trait under study is human obesity, gene 

« 

30 expression data is taken from the Uver, brain, or adipose tissues. 

In some embodiments of the present invention, gene expression / cellular 
constituent data 44 is measured from multiple tissues of each organism 46 ^ig. 1) imder 
study. For example, in some embodiments, gene expression / cellular constituent data 44 
is collected from one or more tissues selected from the group of liver, brain, heart, 

40 



wo 2004/061616 PCT/US2003/041613 

skeletal muscle, white adipose fix)m one or more locations, and blood. In such 
embodiments, the data is stored in an exemplary data structure such as that disclosed in 
Fig. 3C. This data structure is described in more detail below. 

Genotype and/or pedigree data 68 (Fig. 1) conq>ri$e the actual alleles for each 
S genetic marker typed in each individual under study, in addition to the relationships 
between these individuals. The extent of the relationships between the individuals under 
study may be as simple as an F2 population or as complicated as extended human family 
pedigrees. Exemplary sources of genotype and pedigree data are described in Section 6.1, 
infra. In some embodiments of the present invention, pedigree data is optional. 

* 

10 Marker data 70 at regular intervals across the genome under study or in gene 

regions of interest is used to monitor segregation or detect associations in a population of 
interest. Marker data 70 comprises those markers that will be used in the population 
under study to assess genotypes. In one embodiment, marker data 70 comprises the 
names of the markers, the type of markers the physical and genetic location of the 

15 markers in the genomic sequence. Exemplary types of markers include, but are not 
limited to, restriction fragment length polymorphisms "RFLPs"', random amplified 
polymorphic DNA "RAPDs", amplified fragment length polymorphisms "AFLPs'*, 
simple sequence repeats '"SSRs", single nucleotide polymorphisms "SNPs"', 

■ 

miorosateUites, e^c). Further, in some embodiments; marker data 70 comprises the 
20 difGsrent alleles associated with each marker. For example, a particular microsatellite 
marker consisting of 'CA' repeats may have represented ten different alleles in the 
population under study, with each of the ten different alleles in turn consisting of some 
number of repeats. Represratative marker data 70 in accordance with one embodiment of 
the present invention is found in Section 5.2, infra. In one embodiment of the preseat 
25 invention, the genetic markers used comprise single nucleotide polymorphisms (SNPs), 
microsateUite nlarkers, restriction fragment length polymorphisms, short tandem rq)eats, 
DNA methylation markers, and / or sequence l^gth polymorphisms. 

Once starting data are assembled, the first step (Fig. 2, step 204) is to transform 
gene expression data 44 into expression statistics that are used to treat each cellular 
30 constituent abundance in gene expression data 44 as a quantitative trait. In some 
embodiments, gene repression data 44 (Fig. 1) comprises gene expression data for a 
plurality of genes or cellular constituents that correspond to the pluraUty of genes. In one 
embodiment, the plurality of genes comprises at least five genes. In anoth^ embodiment. 



41 



wo 2004/061616 PCTAJS2003/041613 

the plurality of genes comprises at least one hundred genes, at least one thousand genes, 
at least twenty thousand genes, or more than thirty thousand genes. The expression 
statistics commonly used as quantitative traits in the analyses in one embodiment of the 
present invention include, but are not limited to the mean log ratio, log intensity, and 

5 backg^round-corrected intensity. In other embodiments, otho: types of rapression 

statistics are used as quantitative traits. In one embodiment, this transformation (Fig. 2, 
step 204) is performed using normalization module 72 (Fig. 1). In such embodiments, the 
expression level of a plurafity of genes in each organism under study are normalized. 
Any normalization routine may be used by nonnalization module 72. Representative 

1 0 nonnalization routines include^ but are not limited to, Z-score of intensity, median 

intensity, log median intensity, Z-score standard deviation log of intensity, Z-score mean 

* 

absolute deviation of log intensity caUbration DNA gene set, user normalization gene set, 
ratio median intensity correction, and intensity background correction. Furthermore, 
combinations of nonnalization routines may be run. Exemplary normalization routines in 
15 accordance with the present invention are disclosed in more detail in Section 5.3, infra. 
The expression statistics formed from the transformation are then stored in Expression / * 
• genotype warehouse 76, where they are ultimately matched with the corresponding 
geaotj^e information. 

i ■ ♦ 

In addition to the generation of expression statistics fix)m gene expression data 44, 
20 a genetic marker map 78 is generated fipm genetic markers 70 (Fig. 1 ; Fig. 2, step 206). 
In one embodiment of the present invention, a genetic marker map is created using 
marker map construction module 74 (Fig, 1). Further, in one embodiment, genotype 
probability distributions for the individuals under study are computed. Genotype 
probabiUty distributions take into account information such as marker information of 
25 parents, known genetic distances between markers, and estimated genetic distances 
between the markers. Computation of genotype probabihty distributions generally 
requires pedigree data. In some embodiments of the present invention, pedigree data is 
not provided and genotype probability distributions are not computed. 

Once the expression data has been transformed into corresponding expression 
30 statistics and genetic marker map 78 has been constructed, the data is transformed into a 
structure that associates all madcer, genotype and egression data for ii^ut into.QTL 
analysis software. This structure is stored in eiqiression / genotype warehouse 76 (Fig. 1; 
Fig. 2, step 208). 



42 



r 



wo 2004/061616 PCT/US2003/041613 

A qxiantitative trait locus (QTL) analysis is perfonned using data corresponding to 
each gene in a plurality of genes as a quantitative trait (Fig. 2, step 210). For 20,000 
genes, this results in 20,000 separate QTL analyses. For embodiments in which multiple 
tissues samples are collected for each organism, this results in even more separate QTL 

5 analysis. For example, in embodiments in which samples are collected from two different 
tissues, an analysis of 20,000 genes requires 40,000 separate QTL analyses. In one 
embodiment, each QTL analysis is performed by genetic analysis module 80 (Fig. 1). In 
one example, each QTL analysis steps througlh each chromosome ui the genome of the 
organism of interest. Linkages to the gene under consideration are tested at each step or 

10 location along the length of the chromosome. In such embodiments, each step or location 
along the length of the chromosome is at regularly defined intervals. la some 
embodiments, these regularly defined intervals are defined in Morgans or, more typically, * 
centiMorgans (cM). A Morgan is a unit that expresses the genetic distance between 
maikers on a chromosome. A Morgan is defined as the distance on a chromosome in 

IS which one recombinational event is expected to occur per gamete per generation. In 
some embodiments, each regularly defined interval is less than 100 cM. In other 
embodiments, each regularly defined interval is less than 10 cM, less than 5 cM, or less 
than 2.5 cM. 

In each QTL analysis, data corresponding to a gene selected fi?om a plurality of 
20 genes under study is used as a quantitative trait More specifically, for any given gene, 
the quantitative trait used in the QTL analysis is an expression statistic set such as set 304 
(Fig. 3A). Expression statistic set 304 comprises the coiresponding expression statistic 

« 

308 for the gene 302 firom each organism 306 in the population under study. Fig. 3B 
illustrates an exemplary expression statistic set 304 in accordance with one embodiment 

25 of the present invention. Exemplary expression statistic set 304 includes the expression 
level 308 of a gene G (or cellular constituent that corresponds to gene G) firom each 
organism in a plurality of organisms. For example, consider the case where there are ten 
organisms in the plurality of organisms, and each of the ten organisms expresses gene G. 
In this case, expression statistic set 304 includes ten entries, each entry corresponding to a 

30 different one of the ten organisms in the plurality of organisms. Furth^, each entry 
represents die expression level of gene G in the organism represented by the entry. So, 
entry "1" (308-G-I) corresponds to the expression level of gene G in orgaiiism 1, entry 
«'2" (308-G-2) corresponds to the expression level of gene G in organism 2, and so forth. 



43 



I ■ . 

wo 2004/061616 PCTAJS2003/041613 

Referring to Fig. 3C, in some embodiments of the present invention, expression 
data from multiple tissue samples of each organism 306 (Fig, 1, 46) under study are 
collected When fliis is the case, the data can be stored in the exemplary data structure 
illustrated in Fig. 3C. In Fig. 3C, a plurality of genes 302 are represented. Further, there 
5 is an expression statistic set 304 for each gene 302. Each expression statistic set 304 
represOT^ts the expression level (308) of the gene or an abundance of a cellular constituent 
. (308) that corresponds to the gene in each of a plurality of organisms 306 (Fig. 1, 46). In 
one 6xanq>le, a cellular constituent is a particular protein and the cellular constituent 
corresponds to a gene when the gene codes for the cellular constituent. 

10 In one embodiment of the present invention, each QTL analysis (Fig. 2, step 210) 

comprises: (i) testing for linkage between a position in a chromosome and the quantitative 
trait used in the quantitative trait locus (QTL) analysis, (ii) advancing the position in the 
chromosome by an amount, and (iii) repeating steps (i) and (ii) until aU or a portion of the 
geriome has been tested. In typical embodiments, the quantitative trait is the expression 

15 statistic set 304, such as the set illustrated in Fig. 3B. In some embodiments, testing for 
Unkage between a given position in the chromosome and the expression statistic set 304 
' comprises correlating differences in the expression levels found in the expression level 
statistic with differences in the genotype at the given position using single marker tests 
(for example using Mests, analysis of variance, or simple linear regression statistics). 

20 . See, e.g., Statistical Methods, Snedecor and Cochran, 1985, Iowa State University Press, 

• ' . , ■ ■ 

Ames, Iowa! However, there are many other methods for testing for linkage between 
e3q)ression statistic set 304 and a given position in the chromosome. In particular, if 
expression statistic set 304 is treated as the phenotype (in this case, a quantitative 
phenotype), then methods such as those disclosed in Doerge, 2002, Moping and arialysis 
25 of quantitative trait loci in experimental populations. Nature Reviews: Genetics 3 :43-62, 
may be used. Concerning steps (i) through (iii) above, if the genetic length of a given 
chromosome is N cM and 1 cM steps are used, then N different tests for linkage are 
performed on the given chromosome. For organisms having multiple chromosomes, this 
process is repeated for each chromosome in the genome. 

30 In some embodiments, the QTL data produced from each respective QTL analysis 

comprises a logarithm of the odds score (lod) computed at each position tested in the 
genome under study. A lod score is a statistical estimate of whether two loci are likely to 
lie near each other on a chromosome and are therefore likely to be genetically linked. In 
the present case, a lod score is a statistical estimate of whether a given position in the 

44 



wo 2004/061616 PCT/US2003/O41613 

genome under study is linked to the quantitative trait corresponding to a given gene, lod 

* 

scores are further defined in Section 5.4, infrcu Generally, a lod score of fbree or more 
suggests that two loci are genetically linked, a lod score of 4 or more is strong evidmce 
that two loci are genetically linked, and a lod score of 5 or more is very strong evidence 

5 that two loci are genetically linked. However, the significance of any given lod score 
actually varies fix>m species to species depending on the model used. The generation of 
lod scores requires pedigree data. Accordingly, in embodiments in which a lod score is 
generated, processing step 210 is essentially a linkage analysis, as described in Section 
5.13, with the exception that the quantitative trait under study is derived jfrom data, such 

10 as cellular constituent expression statistics, rather than classical phenotypes such as eye 
color. 

In situations where pedigree data is not available, genotype data bom each of the 
organisms 46 (Fig. 1) for each marker m genetic marker map 78 may be compared to . 

■ 

each quantitative trait (expression statistic set 304) using allelic association analysis, as 
1 5 described in Section 5.14, infra, in order to identify QTL that are linked to each 

expression statistic set 304. In one form of association analysis, an affected population is 
compared to a control population. In particular, haplotype or allelic firequencies in the 
affected population are compared to haplotype or allelic firequencies in a control 
population in order to deteramie whether particular haplotypes or alleles occur at 
20 significantly higher firequency amongst affected compared with control samples. 
Statistical tests such as a chi-square test are used to determine whether thrare are 
differences in allele or genotype distributions. 

Regardless of whether linkage analysis or association analysis is used in step 210, 
the results of each QTL analysis are stored in QTL results database 82 (Fig. 1; Fig. 2, step 

25 212). For each quantitative trait 84 (e3q)ression statistic set 304), QTL results database 82 
comprises all positions 86 in the genome of the organism that werie tested for linkage to 
the quantitative trait 84, Positions 86 are obtamed fix)m genetic mark^ map 70. Further, 
for each position 86, genotype data 68 provides the genotype at position 86, for each 
organism in the plurality of organisms under study. For each such position 86 analyzed 

30 by QTL analysis, a statistical measure (eg., statistical score 88), such as the maximum 
lod score between the position and the quantitative trait 84, is listed. There is a lod score 
for the entire population tested as well as individual lod scores for each of the individuals 
under study. Thus, data structure 82 comprises all die positions in the gaiome of the 
organism of interest that are genetically linked to each quantitative trait 84 tested. 

45 



wo 2004/061616 PCT/US2003/041613 

Fig. 4 provides a more detailed illustration of QTL results database 82. Each 
statistical score 88 (e.g. lod score) measures the degree to which a given position 86 is 
hnked to the corresponding quantitative trait 84 (e.g., expression statistic set 304). The . 
set of statistical scores 88 for any given quantitative trait 84 may be considered (may be 
5 viewed as) a QTL vector. Thus, in some embodiments of the present invention, a QTL 
vector is created for each gene tested in the chromosome of the organism studied. Each 
element of the QTL vector is a statistical score (e.g., lod score) at a dijBferent position in 
the genome of the species under study, hi some embodiments in which gene expression / 
cellular constituent data 44 is collected from multiple tissue samples in each organism 

10 under study, a separate QTL vector is created for each tissue type from which data 44 was 
collected. For example, consider the case in which data 44 (Fig. 1) is collected from two 
dififerent tissues types from each organism 46 under study. In such embodiments, two 
QTL vectors are created for each cellular constituent (e,g.y gene, protein) 48 tested. The 
first QTL vector for a given gene / cellular constituent 48 corresponds to one tissue type 

15 sample and the second QTL vector for the given gene / cellular constituent 48 

corresponds to the second tissue type sanopled. Thus, in effect, in some embodiments in 
which data from multiple tissues is collected, the data from each tissue type is treated for 
purposes of processing steps 202 through 220 as if the data were collected from 
independent organism. However, in step 222, the data from multiple tissues types is 

20 optionally compared in order to detennine the aJQfect that tissue type has on the Unkage 
analysis. Methods that incorporated data from multiple tissues types are described in 
more detail in conjunction with step 222 below as weU as Section 5.6, below. 

In some embodiments, a QTL vector is created for each gene tested in the entire 
genome of the organism studied. The QTL vector comprises the statistical score at each 

25 position tested by the quantitative trait locus (QTL) analysis corresponding to the gene. 
In addition to QTL vectors, gene expression vectors may be constructed from transformed 
gene e3q)ression data 44. Each gene expression vector represents the transformed 
e?q>ression level of the gene from each organism in the population of interest. Thus, any 
given gene egression vector comprises the transformed expression level of the gene from 

30 a plurality of different organisms in the population of interest. 

With the QTL vectors gen^ted, the next st^ of the present invCTtion involves 
the generation of QTL interaction maps from the QTL vectors (Fig. 2, step 214). To 
generate QTL interaction maps, the QTL vectors are clustered into groups of QTLs based 
on the strength of interaction between the QTL vectors. In some embodiments of tiie 

46 



wo 2004/061616 PCT/US2003/041613 

present invention, QTL interaction maps are generated by clustering module 92. In 
embodiments in which QTL vectors generated from several different tissue types, the 
QTL vectors from the various tissue types are clustered since gene expression iu one 
tissue may drive expression in another tissue, hi some requirements, QTL representing 

5 diverse tissues types are clustered. In one embodiment of the present invention, 

agglomerative hierarchical clustaing is applied to the QTL vectors. In this clustering, 
similarity is determined using Pearson correlation coefficients between the QTL vectors 
pairs. In other embodiments, the clustering of the QTL data from each QTL analysis 
comprises application of a hierarchical clustering technique, application of a k-means 

1 0 technique, application of a fuzzy k-means technique, application of Jarvis-Patrick 
clustering technique, application of a self-organizing map or application of a neural 
network. In some embodiments, the hierarchical clustering technique is an agglomerative 
clustering procedure. In other embodinients, the agglomerative clustering procedure is a 

* * 

nearest-neighbor algorithm^ a farthest-neighbor algorithm, an average linkage algorithm, 
IS a centroid algorithm, or a sum-of-squares algorithm. In still other embodiments, the 

hierarchical clustering technique is a divisive clustering procedure. Illustrative clust^ing 
techniques that may be used to cluster QTL vectors are described in Section 5.5, infrfi. 

Since each QTL corresponds to a given gens in a plurality of genes in the 
population of interest, QTL interaction inaps provide information on which QTLs are 

20 linked. Such information may be combined with gene expression data to help elucidate 
biological pathways that affect complex traits. In one embodiment of the present 
invention, a gene e3q)ression cluster map is constructed from gene expression statistics 
(Fig. 2, step 216). A plurality of gene expression vectors are created. Each gene 
expression vector in the plurality of gene expression vectors represents the expression 

25 level, activity, or degree of modification of a particular cellular constituent, such as a 
gene or gene product, in a plurality of cellular constituents in the population of interest 
Th^ a plurality of correlation coefficimts is computed. Each correlation coefficient in 
the plurality of correlation coefficients is computed between a gene expression vector pair 
in the plurality of gene expression vectors. Th^ the plurality of gene expression vectors 

30 are clustered based on the plurality of correlation coeffici^ts in order to form the gene 
expression clust^ map. In one embodiment of the present invention, each correlation 
coefficient in the plurality of correlation coefficients is a Pearson correlation coefficient 
In another embodiment of the present invention, clustering of the plurality of gene 
expression vectors comprises application of a hierarchical clustering technique, . 

47 



wo 2004/061616 PCT/US2003/041613 

application of a k-means technique, application of a fuzzy k-means technique, ^plication 
of a self-organizing map or application of a neural network. In one embodiment of the 
present invention, the hierarchical clustering technique is an aggloinerative clustering 
procedure such as a nearest-neighbor algorithm, a farthest-neighbor algorithm, an average 
5 linkage algorithm, a centroid algorithm, or a sum of squares algorithm. In another 

embodiment of the invention, the hierarchical clustering technique is a divisive clustering 
procedure. Illustrative clustering techniques that may be used to cluster the gene 
expression vectors are described in Section 5.5, infra. 

At tMs stage, the QTL interaction map provides information on individual genes 
10 in gene expression clusters found in gene expression cluster maps. Gene expression 
clusters found in gene expression cluster nsiaps may be considered to be in the same 
candidate pathway group. QTL interactions can be used to identify those genes that are 
"closer*' together in a candidate pathway group than oth^ genes. Furfheimore, genes in 
gene expression clusters found in a gene expression map that are not at all genetically 
1 5 interacting may be down-weighted with respect to those genes that are genetically 

interacting. In this way, QTL interaction m^s help to refine candidate pathway groups 
that are identified in gene expression cluster maps. However, the QTL interaction map 
does not provide the actual topology of the pathway. An illustrative topology of a 
biological pathway may be, for example, that gene A is upstream of gene B. Another 
20 drawback of the QTL interaction map is that the map may include Mse positives. For 
example, a cluster within tiie QTL interaction map may include genes that do not interact 
genetically. To shed light on the topology of biological pafliways associated with 
complex diseases, as well as to eliminate false positive genes, processing steps 216 
through 222 are performed, as described in detail below. 

25 In one embodiment of the present invention, the next step involves mjq)ping all 

probes used to generate gene expression data 44 (Fig. 1) to their respective genomic and 
* genetic coordinates. This information aids in establishing the potential for a given gene 
to correspond directly to a particular QTL (i.e., that a gene actually was the QTL). 

In one embodiment of the present invention, clusters of QTL interactions fix>m the 
30 QTL intaraction.maps and clusters ofgene expression interactions fix) 

expression cluster maps are represented in cluster database 94 (Fig. 1; Fig. 2, step 218). 
Cluster database 94 is used to identify the patterns that feed a multivariate QTL analyses. 



48 



wo 2004/061616 PCT/US2003/041613 

In addition to the QTL and gene expression cluster information, the physical locations of 
the QTLs and genes are represented in cluster database 94. 

In some embodiments of the present invention, a gene is identiiBed in the QTL 
interaction naap by filtering the QTL interaction map in order to obtain a candidate 
S pathway group. In one embodiment, this filtering comprises selecting those QTL for the 
candidate pathway group that interact most strongly with another QTL in the QTL 
interaction map. In some embodiments, the QTL that interact most strongly with another 
QTL in the QTL interaction map are all QTL, represented in the QTL interaction map, 
that share a correlation coefficient with another QTL in the QTL interaction map that is 
10 hi^er than 75%, 85%, or 95% of all correlation coefficients computed between QTLs in 
the QTL interaction map. 

In one embodiment of the present invention, cluster database 94 is used to 
associate a gene with a trait. Typically, the trait of interest is a complex trait 
Representative traits include, but are not limited to, disease status, tumor stage, 

15 triglyceride levels, blood pressure, and/or diagnostic test results. In this embodiment, the 
QTL interaction map and/or data stored in cluster database 94 is filtered in order to obtain 
a candidate pathway group (Fig. 2, step 220). This filtering comprises identifying a QTL 
in the candidate pathway group in the gene expression cluster rmap. In. one example in 
accordance with titds embodiment of the present invention^ the QTL interaction map is 

20 filtered by identifying groups of QTL within the QTL interaction map that interact closely 
with one another. The genes associated with each QTL in the groups of QTL that interact 
closely with one another in a QTL interaction map are considered candidate pathway 
groups. In some embodiments, the filtering further comprises looking up the genes in 
each of the candidate pathway groups in the gene e>^ression interaction map. Of interest 

25 is whether the genes in the candidate pathway groups identified in the QTL interaction 
map interact closely with each other in the gene expression interaction map. In some 
embodiments, the topology of pathway groups biological pathways) can be 
determined by identifying genes that colocalize with one of their QTL, . as described in 
. Section 6.7.1, zn/rfl. 

30 In general, patterns of interest may be identified by querying cluster database 94. 

Such groups may be identified by filtering on strength of QTL-QTL interactions, which 
identifies those genes that are most strongly genetically interacting, and then combining 
this information with genes that are the most tightly clustered within these groups. The 

49 



wo 2004/061616 PCT/US2003/041613 

size of these groups is easily adjusted by scaling the threshold parameters used to identiiy 
QTL and/or genes that are interacting. Such groups could themselves be considered 
putative pathway groins. However, another approach is to fit the groups to genetic 
models in order to test whether the genes are actually part of the same patfiway, 

5 In one embodiment in accordance with the present invention, the degree to which 

each QTL making up a candidate pa&way group belongs with other QTLs within the 
candidate pathway group is tested by fitting a multivariate statistical model to the 
candidate pathway group ^ig. 2; step 222). Multivariate statistical models have the 
: capability of simultaneously considering multiple quantitative traits, modeling epistatic 
10 interactions between the QTL and testing other interesting variations that test whether 
genes in a candidate pathway group belong to the same or related biological pathway. 
Specific tests can be done to determine if the traits under consideration are actually 
controlled by the same QTL (pleiotropic effects) or if they are independent 

Importantiy, multivariate statistical analysis can be used to simultaneously 
15 . consider multiple traits at the same time. TMs is of use to deterinine whether the traits are 
genetically linked to each other. Accordingly, in such embodiments, a cluster of QTL 
found in the QTL interaction map produced in step 214 and verified using the gene 
expression cluster map produced in step 216 can be subjected to multivariate statistical 
> analysis.iii order to deteniiine whether the QTL are all genetically linked Such an 
20 analysis may determine that some of the QTL in the cluster found in the QTL interaction 
' map are, in fact, linked whereas other QTL in the cluster are not lin^ 

Multivariate statistical analysis can also be used to study the same trait firom 
multiple tissues. Multivariate statistical analysis of the same trait firom multiple tissues 
can be used to determine whether genetic linkage varies on a tissue specific basis. Such 
25 techniques are of use, for example, in instances where a complex disease has a tissue 
specific etiology. In some instance, multivariate anal^^is can be used to simultaneously 
consider multiple traits firom multiple tissues. Exemplary multivariate statistical models 
that may be used in accordance with the present invention are found in Section 5.6, infra. 

The results of the multivariate QTL analysis are used to 'Validate" the candidate 

• * . 

30 pathway groups. These validated groups are th^ represented in a database and made 
available for the final stage of analysis, which involves reconstructing the pathway. At 
this stage the database comprises g^es that are under some kind of common genetic 
control, interact to some degree at the expression level, and that have been shown to be 

50 



wo 2004/061616 PCT/US2003/041613 

Strongly enough interacting at these different levels to perhaps belong to the same or 
related pathways. Thus, in some instance, the association of a gene with a trait exhibited 
by one or more oiganisms in a population of interest results in the placement of the gene 

in a pathwaysrocqj that comprises genes that are part of the same or related pathway. 

« 

5 The. final step involves an attempt to partially reconstruct the pathways within a 

given pathway group. For each candidate pathway group, the interactions between the 
representative QTL vectors and gene expression vectors can be examined. Furthermore, 
QTL and probe location information can be used to begin to piece together causal 
pathways. In addition, graphical models can be fit to the data using the interaction 

10 strengths, QTL overlap and physical location information accumulated firom the previous 
stq)s to weight and direct the edges that link genes in a candidate pathway group. 
Application of such graphical models is used to determine which genes are more closely 
. linked in a candidate pathway group and therefore suggests models for constraining the 
topology of the pathway. Thus, such models test whether it is more likely that the 

15 candidate pathway proceeds in a particular direction, given the evidence provided by the 
interactions, QTL overlaps, and physical QTL/probe locatioiL The end result of this 
process, after starting with ^pression data, genotype data, naarker data, and clinical trait 
data, is a set of pathway groups consisting of genes that are supported as being part of the 
same or related pathway, and causal information tiiat indicates the exact relationship of 

20 genes in the pathway (or of a partial set of genes in the pathway). 



5.2, SOURCES OF MARKER DATA 

Several forms of genetic markers that are used to construct marker map 78 are 
known in the art. A common genetic marker is single nucleotide polymorphisms (SNPs). 

25 SNPs occur approximately once every 600 base pairs in the genome. See, for example, 
Kruglyak and Nickerson, 2001, Nature Genetics 27, 235. The present invention 
cont^nplates the use of genotypic databases such as SNP databases as a source of genetic 
maricers. Alleles making blocks of such SNPs in close physical proximity are often 
correlated, resulting in reduced genetic variabUity and defining a limited number of 

30 "SNP h^lotypes" each of which reflects descent fiom a single ancient ancestral 

chrdnosome. SeeFullertoneM/.,2000, Am. J. Hum. Genet 67,881. Such haplotype 
structure is useful in selecting ^ropriate genetic variants for analysis. Patil et al found 
that a very dense set of SNPs is required to capture all the common haplotype 



51 



wo 2004/061616 



PCT/US2003/041613 



intbnnation. Once common haplotype information is available, it can be used to identify 
much smaller subsets of SNPs useful for comprehensive whole-genome studies. See Patil 
et al, 2001, Science 294, 1719-1723. 

Other suitable sources of genetic markers include databases that have various 



(microarray), high-density oligonucleotide array (HDA), hybridization filter (filter) and 
serial analysis of gene expression (SAGE) data. Anoth^ example of a genetic database 
that can be used is a DNA methylation database. For details on a representative DNA 
methylation database, see Grunau et al^ in press, MefbDB- a public database for DNA 
1 0 methylation data. Nucleic Acids Research; or the URL: 
http-7/gehome.imb-jena.de/public.html. 

In one embodiment of the present invention, a set of genetic markers is derived 
fix>m any type of genetic database that tracks variations in the genome of an organism of 
interest Information that is typically represented in such databases is a collection of 

1 S locus within the g^ome of the organism of interest. For each locus, strains for which 
genetic variation information is available are represented; For each represented strain, 
variation information is provided. Variation information is any type of genetic variation 
information. Representative genetic variation information includes, but is not limited to, . 
single nucleotide polymorphisms, restriction firagment length polymorphisms, 

20 mictosatellite markers, restriction firagment length polymorphisms, and short tandem 
repeats. Therefore, suitable genotypic databases include, but are not limited to: 

Genetic variation type Uniform resource location 

SNP http://bioinfo.paLroche.com/us\ikaJbioinfonnatics/cgi-bi 



5 types of gene ej^ression data firom platform types such as spotted microarray 



Miax>satellite markers 



Restriction firagment 
length polymorphisms 

Short tandem repeats 

Sequence length 



SNP 



SNP 



SNP 



SNP 




52 



wo 2004/061616 PCT/US2003/041613 



Genetic variation type Unifonn resource location 



polymorphisms 

DNA methylation http://genome.imb-jena.de/public.html 
database 

Short tandem-repeat Broman et al , 1 998, Comprehensive human genetic 
polymorphisms maps: Individual and sex-specific variation in 

recombination, American Journal of Human Genetics 

63, 861-869 

Microsatellite markers Kong et al , 2002, A high-resolution recombination map 

of the human genome^ Nat Genet 3 1 , 241-247 

In addition, the genetic variations used by the methods of the present invention 
may involve differences in the expression levels of genes rather than actual identified 
variations in the coniposition of the genome of the organisrn of interest Therefore, 
S genotypic databases within the scope of the present invention include a wide array of 
expression profile databases such as the one found at the URL: 
http://www.ncbi.nlm.nihi.gov/geo/. • 

Another form of genetic marker that may be used to construct marker map 78 is 
restriction firagment length polymorphisms (RFLPs). RFLPs are the product of allelic 

10 difierences between DNA restriction fragments caused by nucleotide sequence 

variability. As is well known to thosfe of skill in the art, RFLPs are typically detected by 
extraction of genomic DNA and digestion with a restriction eaidonuclease. Generally, the 
resulting fragments are separated according to size and hybridized with a probe; single 
copy probes are preferred. As a result, restriction fragments from homologoiis 

IS chromosomes are revealed. Differences in finsigment size among alleles represent an 

RFLP (see, for example, Helentjaris et a/., 1985, Plant MoL Bio. 5:109-1 18, and U.S. Pat 
No. 5,324,63 1). Another form of genetic marker that may be used to construct marker 
map 78 is random amplified polymorphic DNA (RAPD). The phrase **random amplified 
polymorphic DNA" or 'TIAPD" refers to the amplification product of the distance 

20 betwea:! DNA sequences homologous to a single oligonucleotide primer appearing on 
different sites on opposite strands of DNA. Mutations or rearrangements at or between 
binding sites, will result in polymorphisms as detected by the presence or absence of 
amplification product (see, for example, Welsh and McClelland, 1990, Nucleic Acids 
Res. 18:7213-7218; Hu and Quiros, 1991, Plant Cell Rep. 10:505-51 1 ). Yet anoflier 

25 form of genetic marker map that may be used to construct marker msp 78 is amplified 
fiagmeat leagth polymorphisms (AFLP). AFLP technology refers to a process that is 



53 



wo 2004/061616 PCT/US2003/O41613 

designed to generate large numbers of randomly distributed molecular markers (see, for 
example, European Patent Application No. 0534858 Al). Still another form of genetic 
maiker mzp that may be used to construct marker m^ 78 is '"simple sequence repeats'* or 
"SSRs". SSRs are di-, tri- or tetra-nucleotide tandem repeats within a genome. The 
5 repeat region may vary in length between genotypes while the DNA flanldng the repeat is 

« 

conserved such that flie same primers will work in a plurality of genotypes. A 
polymorphism between two genotypes rq)resents repeats of different lengths between the 
two flanking conserved DNA sequences (see, for example, Akagi et aL, 1996, Theor. 
AppL Genet 93, 1071-1077; Bligh et al, 1995, Euphytica 86:83-85; Struss et al, 1998, 
10 Theor. Appl. Genet. 97, 308-315; Wu et al, 1993, MoL Gen. Genet. 241, 225-235; and 
U.S. Pat. No. 5,075,217). SSR are also known as satellites or microsatellites. 

As described above, many genetic markers suitable for use with the preset 
invention are publicly available. Those skilled in the art can also readily prepare suitable 
markers. For molecular marker methods, see generally. The DNA Revolution by Andrew 
15 H. Paterson 1996 (Chapter 2) in: Genome Mapping in Plants (ed. Andrew H. Paterson) by 
Academic Press/R. G. Landis Conq>any, Austin, Tex., 7-21. 

5.3. EXEMPLARY N.ORMALIZATION ROUTINES 
A number of different nonnalization protocols may be used by normalization 

20 module 72 to normalize gene expression data 44. Some such normalization protocols are 
described in this section. Typically, the normalization comprises normalizing the 
egression level measurement of each gene in a plurahty of genes that is expressed by an 
organism in a population of interest Many of the normalization protocols described in 
this section are used to normalize nucroarray data. It will be appreciated that there are 

25 many other suitable nonnalization protocols that may be used in accordance with the 

present invention. AU such protocols are witiiin the scope of the present invention. Many 
of the normalization protocols found in this section are found in publicly available 
software, such as Microarray E3q)lorer (Image Processing Section, Laboratory of 
Experimental and Computational Biology, National Cancer Institute, Frederick, MD 

30 21702, USA). 

One normalization protocol is Z-score of intensity. In this protocol, raw 
expression intensities are normalized by the (mean intensity)/(standard deviation) of raw 
intensities for all spots in a sample. For microarray data, the Z-score of intensity method 



54 



wo 2004/061616 PCTAJS2003/041613 

nonnaUzes each hybridized sample by the mean and standard deviation of the raw 
intensities for all of the spots in that sample. The mean intensity mnlj and the standard 
deviation sdt are con^juted for the raw intensity of control genes. It is useful for 
standardizing the mean (to 0.0) and the range of data between hybridized samples to 
5 about -3 .0 to +3 .0. When using the Z-soore, the Z differences (Z ditd are computed ratho: 
than ratios. The Z-score intensity (Z-scoreg) for intensily for probe i (hybridization 
probe, protein, or other binding entity) and spot j is computed as: 

Z-scoreij = (lij - mny / sdii, 

and 

10 Zdifl§(x,y) = Z-score^^ - Z-scoreyj 

where 

X represents the x channel and y represents the y channel 

Another normalization protocol is the median intensity normalization protocol in 
which the raw intensities for all spots in each sample are normalized by the median of the 
1 5 raw intensities. For microarray data, the median intensity normalization method 

nomializes each hybridized sample by the median of the raw intensities of control genes 
(medianii) for all of the spots in that sample. Thus, upon normalization by the median 
intensity normalization method, the raw intensity ly for probe i and spot j, has the value 
Imy where, 

20 * Imij = (lij/ medianii). 

m * 

Another normalization protocol is the log median intensity protocol. In this 
protocol, raw expression intensities are normalized by the log of the median scaled raw 
int^isities of represCTitative spots for all spots in the sample. For nodcroarray data, the log 
median intensity method normalizes each hybridized saniple by the log of median scaled 
25 raw intensities of control genes (medianii) for all of the spots in that sample. As used 
herein, control genes are a set of genes that have reproducible accurately measured 
e^qn-ession values. The value 1.0 is added to the intensity value to avoid taking the 
log(0.0) when intensity has zero value. Upon normalization by the median intensity 
nomialization method, the raw intensity Ijj for probe i and spot j, has the value Im^ where, 

30 hnij = log(l .0 + (I^ medianii)). 

Yet another normalization protocol is the Z-score standard deviation log of 
iatensity protocol, fix this protocol, raw aq>ression intensities are normalized by the mean 
log intensity (mnLIi) and standard deviation log iatensity (sdLIj). For microarray data, the 

55 



wo 2004/061616 PCT/US2003/041613 

imean log intensity and the standard deviation log intensity is computed for the log of raw 
intensity of control genes. Then, the Z-score intensity ZlogSij for probe i and spot j is: 

ZlogSij = OogCIij) - ninLIi)/sdLIj. 

Still another normalization protocol is the Z-score mean absolute deviation of log 
5 intensity protocol. In this protocol, raw expression intensities are normalized by the Z- 
score of the log intensity using the equation (log(intensity)-mean logarithm) / standard 
deviation logarithm. For microarray data, the Z-score mean absolute deviation of log 
intensity protocol normalizes each bound sample by the mean and mean absolute 
deviation of the logs of the raw intensities for all of the spots in the sample. The mean 
10 log intensity mnLIj and the mean absolute deviation log intensity madLIj are computed 
for the log of 

raw intensity of control genes. Then, the Z-score intensity ZlogAy for probe i and spot j 

• * 

is: 

ZlogAy = (log(Iij) - mnlJQ/madLlj. 

1 5 Another normalization protocol is the user normalization gene set protocol. In 

this protocol, raw expression intensities are normalized by the sum of the genes in a user 
defined gene set in each sample. This method is useful if a subset of genes has been 
detemiined to have relatively constant expression across a set of samples. Yet another 
normalization protocol is the calibration DNA gene set protocol in which each sample is 

20 normalized by the sum of caUbration DNA genes. As used herein, calibration DNA genes 
are genes that produce reproducible expression values that are accurately measured. Such 
genes tend to have the same expression values on each of several different microarrays. 
The algorithm is the same as user nonnalization gene set protocol described above, but 
the set is predefined as the genes flagged as calibration DNA. 

25 Yet another nonnalization protocol is the ratio median intensity correction 

protocol. This protocol is useful in embodiments in which a two-color fluorescence 
labeling and detection scheme is used, (see Section 5.8.I.S.). In the case where the two 
fluors in a two-color fluorescence labeUng and detection scheme are Cy3 and Cy5, 
measurements are normalized by multiplying the ratio (Cy3/Cy5) by 

30 medianCy5/medianCy3 intmsities. If background correction is enabled, measurements 
are normalized by multiplying the ratio (Cy3/Cy5) by (medianCyS-medianBkgdCyS) / 
(medianCy3-medianBkgdCy3) where mediaoBkgd means median background levels. 



56 



wo 2004/061616 PCT/US2003/041613 

In some embodiments, intensity background coirection is used to nomialize 
measurements. The background intensity data firom a spot quantification programs may 
be used to correct spot intensity. Background may be specified as either a global value or 
on a per-spot basis* If the array images have low background, then intensity backgrouiad 
5 correction may not be necessary. 

5.4. LOGARITHM OF THE ODDS SCORES 

Denoting the joint probability of inheriting all genotypes P(g^), and the joint 
probabihty of all observed data x (trait and marker species) conditional on genotypes 
10 thehkelihoodLforasetof datais 

where the summatioa is over all the possible joint genotypes g (trait and maiker) for all 
pedigree members. What is unknown in this likelihood is the recombination fraction 6^ 
on which P(g) depends, 

15 The recombination fiaction is the probability that two loci will recombine during 

.1 

meioses. The recombination fraction % is correlated with the distance between two loci. 
By definition, the genetic distance is defined to be infinity between the loci on different 
chromosomes (nonsyntenic loci), and for such unlinked loci, d = 0.5. For linked loci on 
the same chromosome (syntenic loci), Q < 0.5, and the genetic distance is a monotonic 

20 function of 6. See, eg., Ott, 1985, Analysis of Human Genetic Linkage^ first edition, 
Baltimore, MD, John Hopkins University Press. The essence of linkage analysis 
described in Section 5.13, is to estimate the recombination fraction 6 and to test whether 
^=0.5. When the position of one locus in the genome is known, genetic linkage can be 
exploited to obtain an estimate of the chromosomal position of a second locus relative to 

25 the first locus. In linkage analysis described in Section 5.2, linkage analysis is used to 
map the unknown location of genes predisposing to various quantitative phenotypes 
relative to a large number of marker loci in a genetic map. In the ideal situation, where 
recombinant and nonrecombioant meioses can be counted unambiguously, 8 is estimated 
by the frequaicy of recombinant meioses in a large sample of meioses. If two loci are 

30 linked, then the number of nonrecombinant meioses N is expected to be larger than the 
numbCT of recombinant meioses R. The recombination fraction between the new locus 
and each mark^ can be estimated as: 



r 

ft 



wo 2004/061616 PCT/US2003/041613 

R 

N+R 

The likelihood of interest is: 

• L^I]P(g\mx\g) 

and inferences are based about a test recombination fraction 6 on the likelihood ratio A = 
5 L{d)/ L(l/2) or, equivalently, its logarithm. 

Thus, in a typical clinical genetics study, the likelihood of the trait and a single 
marker is computed over one or more relevant pedigrees. This likelihood function is 
a function of the recombination fraction 6 between the trait classical trait or 
quantitative trait) and the marker locus. The standardized loglikelihood Z{d) == 
10 • logio[i(fl)/Z(l/2)] is referred to as a lod score. Here, 'lod" is an abbreviation for 

"logarithm of the odds." Alodscorepermits visualizationof linkage evidence. As a rule 
of thumb, in human studies, geneticists provisionally accept linkage if 

' Z(B;>3 

at its TnflYiTniiTw 0 on the interval [0,1/2], where 9 represents the maximum on the 
interval. Further, linkage is provisionally rejected at a particular 8 if 

. , Z(Q)<-2. 

15 However, for complex traits, other rules have been suggested. See, for example. Lander 
andKruglyak, 1995, Nature Genetics 11, p. 241. . 

Acceptance and rejection are treated asymmetrically because, with 22 pairs of 
himian autosomes, it is unlikely that a random marker even falls on the same chromosome 
as a trait locus. See Lange, 1997, Mathematical and Statistical Methods for Genetic 
20 Analysis, Springer-Verlag, New Yoik; Olson, 1999, Tutorial in Biostatistics: Genetic 
Mapping of Complex Traits, Statistics in Medicine 18, 2961-2981. 

When the value of L is large, the null hypothesis of no linkage, L(l/2), to a 
marker locus of known location can be rejected, and ttie relative location of the locus 

corresponding to the quantitative trait can be estimated by 0 . Therefore, lod scores 
25 provide a method to calculate linkage distances as well as to estimate the probability that 
two genes (and/or QTLs) are linked. 

S8 



wo 2004/061616 PCTAJS2003/041613 

Those of skill in the art will ^preciate that lod score computation is species 
dependent. For example, methods for computing the lod score in mouse different from 
that described in this section. However, methods for computing lod scores are known in 
the art and the method described in this section is only by way of illustration and not by 
S limitation. 

5.5. CLUSTERING TECHNIQUES 

The subsections below describe exemplary methods for clustering. Such 
techniques can be used to cluster QTL vectors in order to form QTL interaction maps. 

10 The same techniques can be applied to gene expression vectors in order to fonn gene 
expression cluster maps. Further, these techniques can be used to perform unsupervised 
or supervised classification in accordance with processing step 106 and/or step 108 (Fig. 
2). In these techniques, QTL vectors, gene expression vectors, or sets of cellular 
constituent measurements from different organisms in a population are clustered based on 

IS the strength of interaction between the data QTL vectors, gene expression vectors, 
or sets of cellular constituents). More information on clustering techniques can be found 
in Kaufinan and Rousseeuw, 1990, Finding Groups in Data : An Introduction to Cluster 
Analysis, Wiley, New York, NY; Everitt, 1993, Cluster analysis (3d ed.)^ Wiley, New 
York, NY; Backer, 1995, Computer-Assisted Reasoning in Cluster Analysis^ Prentice 

20 Hall, Upper Saddle River, New Jersey, and Duda et al^ 2001, Pattern Classification, 
John Wiley & Sons, New York, NY. 

r 

5.5.1, HIERARCHICAL CLUSTERING TECHNIQUES 

. Hierarchical clustCT analysis is a statistical method for finding relatively 
25 homogmous clusters ofelements based on measured characteristics. Consider a sequence 
of partitions of n samples into c clusters. The first of these is a partition into n clusters, 
each cluster containing exactly one sample. The next is a partition into n-1 clusters, the 
next is a partition into n-2, and so on until the n^, in which all the samples form one 
cluster. Level k in the sequence of partitions occurs when c = n - k + 1 . Thus, level one^ 
30 corresponds to n clusters and level n corresponds to one cluster. Given any two samples 
X and X*, at some level they will be grouped together in the same cluster. If the sequence 
has the property that whenever two samples are in the same cluster at level k they remain 



59 



30 



WO 2004/061616 PCT/DS2003/041613 

* 

together at all higher levels, then the sequence is said to be a hierarchical clustering. 
Duda et al, 2001, Pattern Classification, John Wiley & Sons, New York, 2001: 551. 



5.5.I.I. AGGLOMERATIVE CLUSTERING 

5 • 

In some embodiments, the hierarchical clustering technique used to cluster gene 
analysis vectors is an agglomerative clustering procedure. Agglomerative (bottom-up 
clustering) procedures start with n siagleton clusters and form a sequence of partitions by 
successively merging clusters. The major steps in agglomerative clustering are contained 
10 in the following procedure, where c is the desired number of final clusters, D/ and Dj are 
clusters, X{ is a gene analysis vector, and there are n such vectors: 

1 begin initialize c, 6 Dt ^{jc^}, i = 1, n 

2 doe<-c-l 

3 find nearest clusters, say, A and Dj 
15 4 merge A and jD^ 

5 untilc = e 

6 return c clusters 
■ 7 end 

20 In this algorithni, the terminology a <--b assigns to variable a the new value b. As 

described, the procedure terminates when the specified number of clusters, has been 
obtained and returns the clusters as a set of points. A key point in this algorithm is how to 
measure the distance between two clusters A and The method used to define the 

r 

distance between clusters A and Dj defines the type of agglomerative clustering 
25 technique used. Representative techniques include the nearest-neighbor algorithm, 
farthest-neighbor algorithm, the average linkage algorithm, the centroid algorithm, and 
the sum-of-squares algorithm. 

■ 

Nearest-neighbor algorithm. The nearest-neighbor algorithm uses the following 
equation to measure the distances between clusters: 



d rma(Di,Dj) = min||jc - jc'| 

This algorithm is also known as the minimum algorithm. Furthermore, if the 
algorithm is t^ninated when the distance betwe^ nearest clusters exceeds an arbitrary 
threshold, it is called the single-linkage algorithm. Consider the case in which the data 
points are nodes of a graph, with edges forming a path between die nodes in the same 



60 



wo 2004/061616 PCTAJS2003/041613 

subset Df, Whsa djninQ is used to measure the distance between subsets, the nearest 
neighbor nodes determine the nearest subsets. The merging of A a^id Dj corresponds to 
adding an edge between the nearest pair of nodes in A and Dj. Because edges linking 
clusters always go between distinct clusters, the resulting graph never has any closed 

5 loops or circuits; in the tenninology of graph theory, this procedure generates a tree. If it 
is aUowed to continue until aUofthe subsets are linked, the result is a spaimingtree^ A 
spanning tree is a tree with a path from any node to any other node. Moreover, it can be 
shown that the sum of the edge lengths of the resulting tree will not exceed the sum of the 
edge lengths for any other spanning tree for that set of samples. Thus, with the use of 

10 dminQ as the distance measure, the agglomerative clustering procedme becomes an 
algorithm for generating a minimal spanning tree. See Duda et al^ id, pp. 553-554. , 

Farthest-neighbor algorithm. The farthest-neigjibor algoriflmi uses the following 
equation to measure the distances between clusters: 

d max(Z)i, DJ) = maxilx - x% 

This algorithm is also known as the maximum algorithm. If the clustering is terminated 
15 when the distance between the nearest clusters exceeds an arbitrary threshold, it is called 
the complete-linkage algorithm. The ferthest-neighbor algorithm discourages the growth 
of elongated clusters. Application of this procedure can be tiiought of as producing a 
graph in which the edges coimect all of the nodes in a cluster. In the tenninology of 
graph theory, every cluster contains a complete subgraph. The distance between two 
20 clusters is terminated by the most distant nodes in the two clusters. When the nearest 

clusters are merged, the graph is changed by adding edges between every pair of nodes in 
the two clusters. 

Average linkage algorithm. Anotha: agglomerative clustering technique is the 
average linkage algorithm. The average linkage algorithm uses the following equation to 
25 measure the distances between clusters: 

Hierarchical cluster analysis begms by making a pair-wise comparison of all gene 
analysis vectors in a set of such vectors. After evaluating similarities from all pairs of 
elements in the set, a distance matrix is constructed. In the distance matrix, a pair of 
vectors with the shortest distance (i.e. most similar values) is selected. Then, when the 

61 



wo 2004/061616 PCTAJS2003/041613 

average linkage algoritbm is. used, a **liode" ("cluster") is constructed by averaging the 
two vectors. The similarity mafrix is updated with the new "node" ("cluster'') replacing 
the two joined elements, and the process is repeated n-1 times until only a single elemeat 
remains. Consider six elements, A-F having the values: 

5 A{4.9}, B{8.2}, C{3.0}, D{5.2}, E{8.3}, F{2.3}. 

In the first partition, using the av^age linkage algorithm, one matrix (sol. 1) that could be 
computed is: 

(sol. 1) A {4.9}, B-E{8.25}, C{3.0}, D{5.2}, F{2.3}. 

Alternatively, the jSrst partition using the average linkage algorithm could yield the 
10 matrix: 

(sol. 2) A {4.9}, C{3.0}, D{5.2}, E-B{8.25}, F{2.3} . 

Assuming that solution 1 was identified in the first partitioii, the second partition 
using the average linkage algorithm will yield: 

(sol. 1-1) . A.D{5.05}, B-E{8.25}, C{3.0}, F{2.3} 

15 or 

(sol. 1-2) B-E{8.25}, C{3.0}, D-A {5.05}, F {2.3}. 

Assuming that solution 2 was identified in the first partition, the second partition 
of the average linkage algorithm will yield: 

(sol. 2-1) A-D{5.05}, C{3.0}, E.B{8.25}, F{2.3} 
20 or . 

(sol. 2-2) C{3.0}, D-A{5.05}, E-B{8.25}, F{2.3}. 

Thus, after just two partitions in the average linkage algorithm, there are already four 
matrices. See Duda et aLy Pattern Classification, John Wiley & Sons, New York, 2001, p< 
551. 

25 

5.5.1.2. CLUSTERING WITH PEARSON CORRELATION COEFFICIENTS 
In one embodiment of the present invention, QTL vectors and/or gene expression 
vectors are clustered using agglomerative hierarchical clustering with Pearson correlation 
coefficients. In this form of clustCTng,.similarity is detmnined using Pearson correlation 
30 coefficients between the C^TL vectors pairs, gene expression pairs, or sets of cellular 
constituent measurements. Other metrics that can be used, in addition to the Pearson 
correlation coefficient, include but are not limited to, a EucUdean distance, a squared 

62 



wo 2004/061616 PCT/US2003/041613 

Euclidean distance, a Euclidean sum of squares, a Manhattan metric, and a squared 
Pearson correlation coefficient. Such metrics may be computed using S AS (Statistics 
Analysis Systems Institute, Caiy, North Carolina) or S-Pl\is (Statistical Sciences, Inc., 
Seattle, Washington); • 

5 

S.S.1.3. DIVISIVE CLUSTERING 
In some embodiments, the hierarchical clustering technique used to cluster QTL 
vectors and/or gene expression vectors is a divisive clustering procedure. Divisive (top- 
down clustering) procedures start with all of the samples in one cluster and form the 
10 sequence by successfully splitting clusters. Divisive clustering techniques are classified 
as either a polythetic or a monthetic method. A polythetic approach divides clusters into 
arbitrary subsets. 

■ 

5.5.2. K-MEANS CLUSTERING 
15 In k-means clustering, sets of QTL vectors, gene expression vectors, or sets of 

cellular constituent measurements are randomly assigned to K user specified clusters. 
The centroid of each cluster is computed by averaging the value of the vectors in each 
cluster. Then, for each i= 1, N, the distmce between vector Xi and each of the clxister 
centroids is computed. Each vector Xi is then reassigned to the cluster with the closest 
20 centroid. Next, the centroid of each affected cluster is recalculated. The process iterates 
until no more reassignments are made. See Duda et aL, 2001, Pattern Classification, 
Jcim Wiley & Sons, New York, NY, pp. 526-528. A related approach is the fuzzy k- 
means clustering algorithm, which is also known as the fuzzy c-means algorithm. In the 
fiizzy k-means clustering algorithm, the assumption that every QTL vector, gene 
25 expression vector, or set of cellular constituent measurements is in exactly one cluster at 
any given time is relaxed so that every vector (or set) has some graded or "fiizzy" 
membership in a cluster. See Duda et al, 2001, Pattern Classification, John Wiley & 
Sons, New York, NY, pp. 528-530. 

... * ■ 

30 5^3. JARVIS-PATRICK CLUSTERING 

Jaryis-Patrick clustering is a nearest-neighbor non-hierarchical clustering method 
in which a set of objects is partitioned into clusters on the basis of the number of shared 
nearest-neighbors. In the standard implementation advocated by Jarvis and Patrick, 1973, 

63 



wo 2004/061616 PCT/US2003/041613 

IEEE Trans. Compui., C-22:1025-1034, a preprocessing stage identifies the K 
nearest-neighbors of each object in the dataset. In the subsequent clustering stage, two 
objects i and j join the same cluster if (i) i is one of the K nearest-neighbors of j, (ii) j is 
one of the K nearest-neighbors of i, and (iii) i and j have at least k„nn of their K 
5 nearest-neighbors in conimoii, where K and kinm are usCT-defined parameter^^ The 

method has been widely applied to clustering chemical structures on the basis of firagment 
descriptors and has the advantage of being much less computationally demanding than 
hierarchical methods, and thus more suitable for large databases. Jarvis-Patrick clustering 
may be pCTfonned using the Jarvis-Patrick Clustering Package 3.0 (Barnard Chemical 
10 Infonnation, Ltd., Sheffield, United Kingdom). 

5.5.4. NEURAL NETWORKS 
A neural network has a layered stracture that includes a layer of input units (and 
the bias) connected by a layer of weights to a layer of output units. In multilayer neural 
15 networks, there are iaput units, hidden units, and output units, hi fact, any function from 
input to output can be implemented as a three-layer network. la such networks, the - 
weights are set based on traimng patterns and the desired output One method for 
supervised training of multilayer neural networks is back-propagation. Back-propagation 
allows for the calculation of an effective error for each hidden unit, and thus derivation of 
20 a learrdng rule for the input-to-hidden weights of the neural network 

■ * • 

The basic approach to the use of neural networks is to start with an untrained 
network, present a training pattern to the input layer, and pass signals through the net and 
determine the output at the output layer. These outputs are then compared to the target 

values; any difference corresponds to an error. This error or criterion function is some 

« 

25 scalar function of the weights and is minimized when the network outputs match the 

desired outputs. Thus, the weights are adjusted to reduce this measure of error. Three 

commonly used training protocols are stochastic, batch, and on-line. In stochastic 

training patterns are chosen randomly from the training set and the network weights are 

updated for each pattern presentation. Multilayer nonlinear networks trained by gradient 

30 descent methods such as stochastic back-propagation perform a maximum-likelihood^ 

- ■• - . • - 
estimation of the weight values in the model defined by the network topolo jgy . In batch 

training, all patterns are presented to the network before learning takes place. Typically, 



64 



wo 2004/061616 PCT/DS2003/041613 

in batch training, sev^al passes are made through the training data. In online training, 
each pattern is presented once and only once to the net 

* 

5.5.5. SELF-ORGANIZING MAPS 

5 A self-organizing map is a neural-network that is based on a divisive clustering ' 

^roach. The aim is to assign genes to a series of partitions on the basis of the similarity 
of their expression vectors to reference vectors that are defined for each partitioiL 
ConsidCT the case in which there are two microarrays from two different e^speriments. It 
is possible to build up a two-dimensional construct where every spot corresponds to the 

10 expression levels of any given gene in the two experiments. A two-dimensional grid is 
built, resulting in several partitions of the two-dimensional construct. Next, a gene is 
randomly picked and the identify of the reference vector (node) closest to the gene picked 
is determined based on a distance matrix. The reference vector is then adjusted so that it 
is more similar to the vector of the assigned gene. That means the reference vector is 

IS . moved one distance unit on the x axis and y-axis and becomes closer to the assigned 

gene. The otiier nodes are all adjusted to the assigned gene, but only are moved one half 
. or one-fourth distance unit. This cycle.is repeated hundreds of thousands times to 
converge the reference vector to fixed value and where the grid is stable. At that time, 
' every reference vector is the center of a group of genes. Finally, the genes are mapped to 

20 the relevant partitions depending on the reference vector to which they are most similar. 

5.6. MULTIVARIATE STATISTICAL MODELS 

Using the methods of the present invention, candidate pathway groups are 
identified from the analysis of QTL interaction map data and gene expression cluster 

25 maps. Each candidate pathway group includes a number of genes. The methods of the 
present invention are advantageous because they filter the potentially thousands of genes 
in the genome of the population of interest into a few candidate pathway groups using 
clustering techniques. In a typical case, a candidate pathway group represents a group of 
genes that tightly cluster in a gene expression cluster map. The genes in a candidate 

30 pathway group may also cluster tightly in a QTL interaction map. The QTL interaction . 
map serves as a complementary approach to defining the genes in a candidate pathway 
groiq). For example, consider the case in which genes A, B, and C cluster tightly in a 
gene expression cluster map. Furthermore, genes A, B, C and D cluster tightly in the 



65 



wo 2004/061616 PCTAJS2003/041613 

corresponding QTL interaction map. In this example, analysis of the gene expression 
cluster map alone suggest that genes A, B, and C form a candidate pathway group. 
However, analysis of both the QTL interactioii map and the gene expression cluster map 
suggest tiiat the candidate pathway group comprises genes A, B, C, and D. 

5 Once candidate pathway groups have been identified, multivariate statistical 

techniques can be used to determine whether each of the genes in the candidate pathway 
gcGup affect a particular trait, such as a complex disease trait The form of multivariate 
statistical analysis used in some embodiments of the present invention is dependent upon 
on the type of genotype and/or pedigree data that is available. 

10 Typically, more pedigree data is available in cases where the population to be 

studied is plants or animals, in such instances, the multivariate statistical models such as 
• those of Jiang and Zeng, 1995, Nature Genetics 140, pp.1 1 1 1.-1 127, as well as the 

techniques unplemented in QTL Cartographer (Hasten and Zeng, 1994, Zmap-a QTL 

cartographer. Proceedings of the 5th World Congress on Genetics Applied to Livestock 
15 Production: Computing Strategies and Software 22, Smitih et al eds., pp. 65-66, The 

Organizing Coimnittee, 5th World Congress on Genetics Applied to Livestock 
. Production, Guelph, Ontario, Canada; Basten et al, 2001, QTL Cartographer, Version 

LIS, Department of Statistics, North Carolina State University, Raleigh, North Carolina. 

In addition, marker regression (joint mapping, marker-difference regression, MDR), 
20 interval mapping with marked cofactors, and composite interval mapping can be used. 

See, for exaniiple. Lynch & Walsh, 1998, Genetics and Analysis of Qjuantitative Traits, 

Sinauer Associates, hue, Sunderland, MA. 

Jiang and Zeng have developed a midtiple-trait extension to composite interval 
mapping (CIM). See, for example, Jiang and Zeng, 1995, Genetics 140, p. 1 1 1 1. CM 

25 refers to the general approach of adding marker cofactors to an otherwise standard 

interval analysis {eg,, QTL detection using Unear models or via maximum likelihood). 
CM handles multiple QTLs by incorporating mutlilocus marker information from 
organisms by modifying standard interval mapping to include additional markers as 
cofactors for analysis. See, for example, Jansen, 1993, Genetics 135, p. 205; Zeng, 1994, 

30 Genetics 136, p. 1457. The multijle-trait extension to CM developed by Jiang and Zeng 
provides a framework for testing the candidate pathway groups that are constructed using 
the methods of the present invention in cas^ where the genes in these candidate pathway 
groups link to the same genetic region. The methods of Jiang and Zaig allow for the 



66 



wo 2004/061616 PCT/US2003/041613 

deteiininatibn as to whether expression values (for the genes in the candidate pathway 
group) linldng to the same region are controlled by a single gene pleiotropy) or by two 
closely linked genes. If the methods of Jiang and Zeng suggest that multiple ge^es are 
actually controlled by closely linked loci (closely linked genes), then there is not support 

5 that the genes linking to the same region are in the same pathway. Moreover, the 

components (hierarchy) of a pathway can be deduced by testing subsets of the pathway 
group to see which genes have an underlying pldotropic relationship with respect to other 
genes. Further, the definition of Ihe candidate pathway group can be refined by 
eliminating specific genes in the candidate pathway group that do not have a pleiotropic 

10 relationship with other genes in the candidate pathway group. The idea is to determine 
which of the genes linking to given region, have other genes linking to their physical 
location, indicating the order for hierarchy and control. 

Presently, the practical limits are that no more than ten genes can be handled at 
once using multivariate methods such as the Jiang and Zeng methods. Theoretically, the 
1 5 number of genes is limited by the amount of data available to fit the model, but the 

« 

particular limitation is that the optinnzatipn techniques are not effective for greater tiian 
10 dimensions. However, in some embodiments, more than 10 genes can be handled at 
once by iniplementing dimensionahty reductions techniques (Uke principal components). 

For human genotype and pedigree data, methods described in AUison, 1998, 
20 Multiple Phenotype Modeling in Gene-Mapping Studies of Quantitative Traits: Power 
Advantages, Am J. Hum. Genetics 63, pp. 1 190-1201, are used, including, but not limited 
to, those of Amos et al, 1990, Am J. Hum. Genetics 47, pp. 247-254. 

In some embodiments, gene expression data 44 is collected for multiple tissue 
types. In such instances, multivariate analysis can be used to determine the tme nature of 
25 : a complex disease. Multivariate techniques used in this embodiment of the invention are 
described, in part, in Williams et al, 1999, Am J Hum Genet 65(4): 1 134-47; Amos et al, 
1990, Am J Hum Genet 47(2): 247-54, and Jiang and Zeng, 1995, Nature Genetics 
140:1111-1127. 

Asthma provides one example of a complex disease that can be studied using 
30 expression data firom multiple tissue types. Asthma is expected to, in part, be influenced 
by immune system response not only in lungs but also in blood. By meaisuring e;q>ression 
of genes in the lung and in blood, the following model could be used to dissect the shared 
genetic eflfect in a model system, e.g. an F2 mouse cross: 



67 



wo 2004/061616 PCT/US2003/041613 

yfl=a^+biXj+d^Zj+eji 

■ 

where, for individual j and a putative QTL: 

. . yjm consists of asflima relevant phenotypes, expression data for gene 
expression in the lung and expression data for gene expression in blood; 

5 jcj is the number of QTL alleles ftom a specific parental line; . 

2j is 1 if the individual is hetwozygous for the QTL and 0 otherwise; 

Qi represents the mean for phenotype i; 

bi and di represent the additive and dominance effects of the QTL on phenotype i; 

and 

1 0 fiji is the residual error for individual j and phenotype i. 

It is typically assumed that the residuals are uncorrelated between individuals, and 
the correlation between residuals within an individual are modeled as Cov(^jk, ejO = puCTk 
ai. Assuming a multivariate normal distribution for the residuals, likelihood analysis can 
' be used to test for joint linkage of a QTL to the trait vector and to test for pleiotropic 

15 effects versus close linkage. With such information, it would be possible to detect a QTL 
that influences, susceptibility to asthma through causing changes in gene expression for a 
set of genes expressed in blood and for a set of, potentially overlapping, genes expressed 
in lung. Such multivariate analyses in accordance with the present invention, combined 
with high quality phenotypic data that includes expression data aax)ss multiple tissues, 

20 allows for improved detection of those genes truly influencing susceptibility to complex 
diseases. 

5.7, ANALYTIC KIT IMPLEMENTATION 
hi a preferred embodiment, the methods of this invention can be implemented by 
25 use ofkits for determining the responses or state of a biological sample. Such kits 
contain microarrays, such as those described in Subsections below. The microairays 
contained in such kits comprise a solid phase, e.g. , a surface, to which probes are 
hybridized or bound at a known location of the solid phase. Preferably, these probes 
consist of nucleic acids of known, different sequence, with each nucleic acid being 

68 



wo 2004/061616 PCT/US2003/041613 

capable of hybridizing to an RNA species or to a cDNA species derived therefrom. In a 
particular embodiment, the probes contained in the kits of this invention are nucleic acids 
capable of hybridizing specifically to nucleic acid sequences derived from RNA species 
in cells collected from an organism of interest. 

S In a preferred embodiment, a kit of the invention also contains one or more 

databases described above and in Fig. 1, encoded on computer readable medium, and/or 
an access authorization to use the databases described above from a remote networked 
computer. 

In another prefetred embodiment, a kit of the invention frulher contains software 
10 enable of being loaded into the memory of a computer system such as the one described 
supray and illustrated in Fig. 1 . The software contained in the kit of this invention, is 
essentially identical to the software described above in conjunction with Fig. 1 . 
Altemative kits for implementing the analytic methods of this invention will be apparent 
to one of skill in the art and are intended to be comprehended within the accompanying 
IS claims. 

A * 

5.8. TRANSCRIPTIONAL STAT£ MEASUREMENTS 
This section provides some exemplary methods for measuring the expression level 
of genes, which are one type of cellular constituent. One of skill in the art will appreciate 
20 that this invention is not limited to the following specific methods for measuring the 
expression level of genes in each organism in a pluraUty of organisms. 

5.8 J. TRANSCRIPT ASSAY USING MICROARRAYS 

The techniques described in this section are particularly useful for the 
25 determination of the expression state or the transcriptional state of a cell or cell type or 
any other cell sample by monitoring expression profiles. These techniques include the 
provision of polynucleotide probe arrays that may be used to provide simultaneous 
detemiination of the expression levels of a plurality of genes. These technique further 
provide methods for designing and making such polynucleotide probe arrays. 

30 The expression level of a nucleotide sequence in a gene can be measured by any 

high througtq>ut techniques. However measured, the result is either the absolute or 
relative amounts of transcripts or response data, including but not limited to values 



69 



wo 2004/061616 PCT/US2003/041613 

rq)reseating abundances or abundance radons. Preferably, measxurement of the 
^pression profile is made by hybridization to transcript arrays, which are described in 
this subsection. In one embodiment, 'transcript arrays" or ''profiling arrays'* are used. 
Transcript arrays can be employed for analyzing the expression profile in a cell sample 
S and especially for measuring the expression profile of a cell sample of a particular tissue 
type or developmental state or exposed to a drug of int^est. 

In one embodiment, an expression profile is obtained by hybridizing detectably 
labeled polynucleotides representing the nucleotide sequences in mKNA transcripts 
present in a cell (e.g,, fluorescently labeled cDNA synthesized fi:om total cell niRNA) to a 

10 microarray. A microarray is an array of positionally-addressable binding (e,g,, 

hybridization) sites on a support for representing many of the nucleotide sequences in the 
genome of a cell or organism, preferably most or almost all of the genes. Each of such 

hr'^ ^ •^lynucleotide probes bound to the predetermined region on the 

support. Microairays can be made in a nmnber of ways, of which several are described 

15 herein below. However produced, microairays share certain characteristics. The arrays 
are reproducible, allowing multiple copies of a given array to be produced and easily 
compared with each other. Preferably, the microairays are made from materials that are 
stable under binding (ag., nucleic acid hybridization) conditions. Microairays are 
preferably small, e,g., between about 1 cm and 25 cm , preferably about 1 to 3 cm . 

20 However^ both larger and smaller arrays are also contemplated and may be preferable, 
eg., for simultaneously evaluating a very large number or very small number of different 
probes. 

Preferably, a given binding site or unique set of binding sites in the microarray 
will specifically bind (eg. , hybridize) to a nucleotide sequrace in a single gene firom a 
25 cell or organism (eg., to exon of a specific mSNA or a specific cDNA derived 
therefrom). 

The microairays used can include one or more test probes, each of which has a 
polynucleotide sequence that is complementary to a subsequence of KNA or DNA to be 
detected. Each probe typically has a different nucleic acid sequence, and the position of 
30. . each probe on the solid surface of the airay.is usually knowiL Indeed, &e microarrays are 
preferably addressable arrays, more preferably positionally addressable arrays. Bach 
probe of the array is preferably located at a known, predetermined position on the solid ' 
support so that the identity (z.6., the sequence) of each probe can be determined firom its 

70 



wo 2004/061616 PCTAJS2003/041613 

position on the array (i.e,, on the support or surface). In some embodiments, the arrays 
are ordered arrays. 

Preferably, the density of pirobes on a microarray or a set of nucroarrays is about 
100 different non-identical) probes per 1 cm or higher. More preferably, a 

5 microarray used in the methods of the invention will have at least 550 probes per 1 cm^, 
at least 1,000 probes per 1 cm^, at least 1,500 probes per 1 ca? or at least 2,000 probes 
per 1 cm^. In a particularly preferred embodiment, the microarray is a high density array, 
preferably having a density of at least about 2,500 different probes per 1 cm^.. The 
microarrays used in the invention therefore preferably contain at least 2,500, at least 

10 5,000, at least 10,000, at least 15,000, at least 20,000, at least 25,000, at least 50,000 or at 
least 55,000 different (ic, non-identical) probes. 

In one embodiment, the microarray is an array {e.g, , a matrix) in which each 
position represents a discrete binding site for a nucleotide sequence of a transcript 
encoded by a gene (e,g.f for an exon of an mRNA or a cDNA derived therefrom). The 

15 collection of binding sites on a microarray contains sets of binding sites for a plurality of 
genes. For example, in various embodiments, the microarrays ofthemvention can 
comprise biading sites for products encoded by fewer than 50% of the genes in the 
genome of an organism. Alternatively, the microarrays of the invention can have binding 
sites for the products encoded by at least 50%, at least 75%,'at least 85%, at least 90%, at 

20 least 95%, at least 99% or 100% of the genes in the genome of an organism. In other 
embodiments, the microarrays of the invention can having binding sites for products 
encoded by fewer than 50%, by at least 50%, by at least 75%, by at least 85%, by at least 
90%, by at least 95%, by at least 99% or by 1 00% of the genes expressed by a cell of an 
organism. The binding site can be a DNA or DNA analog to which a particular RNA can 

25 specifically hybridize. The DNA or DNA analog can be, e.g., a synthetic oligomer or a 
gene fragment, e.g. corresponding to an exon. 

In some embodiments of the present invmtion, a gene or an exon in a gene is 
rq>resent6d in the profiling arrays by a set of binding sites comprising probes with 
differmt polynucleotides that are complementary to different sequence segments of the 
30 gene or the exon. Such polynucleotides ate preferably of the length of 1 5 to 200 bases, 
more preferably of the length of 20 to 100 bases, most preferably 40-60 bases. Each 
probe sequence may also comprise link^ sequences in addition to the sequence that is 
complementary to its target sequence. As used herein, a linker sequence is a sequ^ce 

71 



wo 2004/061616 PCT/US2003/041613 

between the sequence that is complementary to its target sequence and the surface of 
support For example, in preferred embodiments, the profiling arrays of the invention 
con^)rise one probe specific to each target gene or exon. However, if desired, the 
profiling arrays may contain at least 2, 5, 10, 100, or 1000 or more probes specific to 
5 some target genes or exons. For example, the array may contain probes tiled across the 
sequence of Hie longest mSNA isoform of a gene at single base steps. 

In specific embodiments of the invention, when an exon has alternative spliced 
variants, a set of polynucleotide probes of successive overlapping sequences, tiled 
sequences, across the genomic region containing the longest variant of an exon can be 

1 0 included in the exon profiling arrays. The set of polynucleotide probes can comprise 
successive overlapping sequences at steps of a predetermined base intervals, e.g. at steps 
of 1, 5, or 10 base intervals, span, or are tiled across, the mRNA containing the longest 
variant. Such sets of probes therefore can be used to scan the genomic region containing 
all variants of an exon to determine the expressed variant or variants of the exon to 

15 determine the expressed variant or variants of the exon. Alternatively or additionally, a 
set of polynucleotide probes comprising exon specific probes and/or variant junction 
probes can be included in the exon profiling array. As used herein, a variant junction 
probe refer? to a probe specific to the junction region of the particular exon variant and 
&e neighboring exon. In some cases, the probe set contains variant junction probes 

20 specifically hybridizable to each of all different splice junction sequences of the exon. In 
oihsr cases, the probe set contains exon specific probes specifically hybridizable to the 
common sequences in all different variants of the exon, and/or variant jxmction probes 
specifically hybridizable to the different splice junction sequences of the exon. 

In some cases, an exon is represented in tilie exon profiling arrays by a probe 
25 comprising a polynucleotide that is complementary to the fiill length exon. In such 

instances, an exon is represented by a single binding site on the profiling arrays. In some 
preferred cases, an exon is represented by one or more binding sites on the profiling 
arrays, each of the binding sites comprising a probe with a polynucleotide sequence that 
is complementary to an KNA fi:agment that is a substantial portion of the target exon. 
30 The lengflis of such probes are normally between about 15-600 bases, preferably between 
' about 20-200 bases, more preferably between about 30-1 00 bases, and most preferably 
between about 40-80 bases. The average length of an exon is about 200 bases (see, e.g,, 
Lewin, Genes V, Oxford University Press, Oxford, 1994). A probe of length of about 40- 
80 allows more specific binding of the exon than a probe of shorter length, thereby 

72 



wo 2004/061616 PCTAJS2003/041613 

increasing the specificity of the probe to the target exoiL For certain genes, one or more 
targeted exons may have sequence lengths less than about 40-80 bases. In such cases, if 
probes with sequences longer than the target exons are to be used, it may be desirable to 
design probes comprising sequences that include the entire target exon flanked by 

S sequences from the adjacent constitutively splice exon or exons such that the probe 
sequences are complementary to the corresponding sequence segments in the mRNAs. 
Using flanking sequence from adjacent constitutively spliced exon or exons rather than 
the genomic flanking sequences, te,\ intron sequences, pemiits comparable hybridization 
stringency with other probes of the same length. Preferably the flanking sequence \ised 

10 are from the adjacent constitutively spliced exon or exons that are not involved in any 
alternative pathways. More preferably the flanking sequences used do not comprise a 

4 

I 

significant portion of the sequence of the adjacent exon or exons so that cross- 
hybridization can be minimized, la some embodiments, when a target exon that is shorter 
than the deshred probe length is involved in alternative sphcing, probes comprising 
IS flanlring sequences in different alternatively spUced mKNAs are designed so that 

expression level of the exon expressed in different alt^natively spliced mKNAs can be 

* 

measured. 

In some instances, when altemative sphcing pathways and/or exon duplication in 
' separate genes are to be distinguished, the DNA array or set of arrays can also comprise 
20 probes that are complementary to sequences spanning the junction regions of two 
. adjacent exons. Preferably, such probes comprise sequences fix>m the two exons which 
are not substantially overlapped with probes for each individual exons so tiiat cross 
hybridization can be miniroized. Probes that comprise sequences from more than one 
exons are useful in distinguishing altemative splicing pathways and/or expression of 
25 duplicated exons in separate genes if tiie exons occurs in one or more altemative sphced 
mSNAs and/or one or more sq)arated genes that contain the duplicated exons but not in 
other alternatively spliced mRNAs and/or other genes that contain the diq)licated exons. 
Alternatively, for duplicate exons in separate genes, if the exons from different genes 
show substantial difference in sequence homology, it is pref^ble to include probes that 
30 are different so that the exons from different genes can be distinguished. 

« - • 

It will be £^parent to one skilled in the art that any of the probe schemes, supra^ 
can be combined on the same profiling airay and/or on different arrays within the same 
set of profiling arrays so that a more accurate determination of the expression profile for a 
plurality of genes can be accompUshed. It will also be apparent to one skilled in the art 

73 



wo 2004/061616 PCT/US2003/041613 

that the dififerent probe schemes can also be used for diJOferent levels of accuracies in 
profiling. For example, a profiling array or array set comprising a small set of probes for 
each exon may be used to determine the relevant genes and/or RNA splicing pathways 
under certain specific conditions. An array or array set comprising larger sets of probes 
S for the exons that are of interest is then used to more accurately detemiine the exon 

expression profile under such specific conditions. Other DNA array strategies that allow • 
more advantageous use of different probe schemes are also encompassed. 

Preferably, the naicroarrays used in the invention have binding sites (i.e., probes) 
for sets of exons for one or more genes relevant to the action of a drug of interest or in a 

10 biological pathway of interest As discussed above, a '^gene" is identified as a portion of 
DNA that is transcribed by RNA polymerase, which may include a S ' untranslated region 
C*UTR"), introns, exons and a 3' UTR. The number of genes in a genome can be 
estimated &om the number of mKNAs expressed by the cell or organism, or by 
extr^olation of a well characterized portion of the genome. When the genome of the 

15 organism of interest has been sequenced, the number of ORFs can be determined and 
mKNA coding regions identified by analysis of the DNA sequence. For example, the 
genome of Saccharomyces cerevisiae has been completely sequenced and is reported to 
• have approximately 6275 ORFs encoding sequences longer the 99 amino acid residues in 
" length. Analysis of these ORFs indicates that there are 5,885 ORFs that are likely to 

20 encode protein products (Goffeau et al, 1996, Science 274: 546-567). In contrast, ttie 
human genome is estimated to contain approxunately 30,000 to 130,000 genes (see 
Crollius et al., 2000, Nature Genetics 25:235-238; Ewing et al., 2000, Nature Genetics 
25:232-234). Genome sequences for other organisizis, including but not limited to 
Drosophila, C, elegans, plants, e.g., rice and Arabidopsis^ and mammals, e,g,y mouse and 

25 human, are also completed or nearly completed. Thus, in preferred embodiments of the 
invention, an array set comprising in total probes for aU known or predicted exons in the 
genome of an organism is provided. As a non-limiting example, the present invention 
provides an array set comprising one or two probes for each known or predicted exon m 
the human genorne. 

30 It will be appreciated that when cDNA complementary to the RNA of a. cell is 

• " inade and hybridized to a microairay under suitable hybridization conditions, the level of 
hybridization to the site in the array corresponding to an exon of any particular gene will 
reflect the prevalence in the cell of mRNA or mRNAs containing the exon transcribed 
fix>m that gene. For exanrple, when detectably labeled with a fluorophore) cDNA 

74 



t 



wo 2004/061616 PCTAJS2003/O41613 

complementary to the total cellular mRNA is hybridized to a microarray, the site on tihe 
array corresponding to an exon of a gene (i.e., capable of specifically binding the product 
or products of the gene expressing) that is not transcribed or is removed during RNA 
splicing in the cell will have little or no signal (e.g. , fluorescent signal), and an exon of a 
5 gene for which the encoded mSNA expressing the exon is prevalent will have a relatively 
strong signal. The relative abundance of different naSNAs produced from the same gene 
by altemative splicing is then detennined by the signal strength pattern across the whole 
set of exons monitored for the gene. 

In one embodiment, cDNAs from cell samples fi:om two different conditions are 
1 0 hybridized to the binding sites of the microarray using a two-color protocol. In the case 
of drug responses one cell sample is exposed to a drag and another cell sample of the . 
same type is not exposed to the drug. In the case of pathway responses one cell is 
exposed to a pathway perturbation and another cell of the same type is not exposed to the 
pathway perturbation. The cDNA derived from each of the two cell types are differently 
1 5 labeled (e.g. , with. Cy3 and Cy5) so that they can be distinguished. In one embodiment, 
• for example, cDNA from a cell treated with a drug (or exposed to a pathway perturbation) 
is synthesized using a fluorescein-labeled dNTP, and cDNA from a second cell, not 
drug-exposed, is synthesized usmg a rhodamine-labeled dNTP. When the two cDNAs are 
mixed and hybridized to the microarrayi the relative intensity of signal from each cDNA 
20 set is detennined for each site on the array, and any relative difference in abundance of a 
particular exon detected 

hi the example described above, the cDNA from the drug-treated (or pathway 
perturbed) cell will fluoresce green when the fluorophore is stiiriulated and the cDNA 
from the untreated cell will fluoresce red. As a result, when the drug treatment has no 

25 , effect, either direcdy or indirectly, on the transcription and/or post-transcriptiond 
of a particular gene in a cell, the exon expression patterns will be indistinguishable in 
both cells and, upon reverse transcription, red-labeled and green-labeled cDNA will be 
equally prevalent. When hybridized to the microarray, the biading site(s) for that species 
of RNA will emit wavelengths characteristic of both fluorophores. In contrast, when the 

30 drag-exposed cell is treated with a drag that, directly or indirectly, change the 

transcription and/or post-transcriptional sphcing of a particular gene in the cell, the exon 
expression pattern as represented by ratio of green to red fluorescence for each exon 
binding site will change. When the drag increases the prevalence of an mRNA, the ratios 

75 

I 
P 



wo 2004/061616 PCTAJS2003/041613 

for each exon expressed in the mRNA will inorease, whereas when the drug decreases the 
prevalence of an mRNA, the ratio for each exons expressed in the mRNA will decrease. 

The use of a two-color fluorescence labeling and detection scheme to define 
alterations in gene expression has been described in connection with detection of 

5 mRNAs, e.g., in Shena '^t 1995, Quantitative monitoring of gene expression patterns 
with a complementaryDl^ JiicroarrayiTSc^^ 270:467-470, which is incorporated by 
reference in its entirety for all purposes. The scheme is equally ^licable to labeling and 
detection of exons. An advantage of using cDNA labeled with two different fluorophores 
is that a direct and internally controlled comparison of the mRNA or exon expression 

10 levels corresponding to each arrayed gene in two cell states can be made, and variations 
due to minor differences in experimental conditions (e.g., hybridization conditions) will 
not affect subsequent analyses. However, it will be recognized that it is also possible to 
use cDNA firom a single cell, and compare, for example, the absolute amount of a 
particular exon in, e.g., a drug-treated or pathway-perturbed cell and an untreated cell. 

IS Furthermore, labeling with more than two colors is also contemplated in the present 
invention. In some embodiments of the invention, at least 5, 10, 20, or 100 dyes of 
• different colors can be used for labeling. Such labeling permits simultaneous hybridizing 
' of the distiiiguishably labeled cDNA populations to the same array, and thus measuring, 
and optionally comparing the expression levels of, mRNA molecules derived j&om more 

20 than two samples. Dyes that can be used include, but are not limited to, fluorescein and 
its derivatives, rhodamine and its derivatives, texas red, 5'carboxy-fluorescein (*TMA**), 
2',7'-dimetIioxy-4',5 '-dichloro-6-carboxy-fluorescem ("JOE"), N,N,N',N'-tetramethyl-6- 
carboxy-rhodamine ('TAMRA"), 6'carboxy-X-rhodamine CROX"), HEX, TET, mD40, 
and IRD41, cyamine dyes, including but are not limited to Cy3, Cy3.5 and CyS; 

25 BODBPY dyes including but are not limited to BODIPY-FL, BODIPY-TR, BODIPY- 
TMR, BODIPY-630/650, and BODIPY-650/670; and ALEXA dyes, including but are 
not limited to ALEXA-488, ALEXA.-532, ALEXA-546, ALEXA-568, and ALEXA-594; 
as well as other fluorescrat dyes which will be known to those who are skilled in the art. 

In some embodiments of the invention, hybridization data are measured at a 
3 0 plurality of different hybridization times so that the evolution of hybridization levels to 

- • * " ' . 

equilibrium can be determined. In such embodiments, hybridization levels are most 
preferably measured at hybridization times spanning the range from 0 to in excess of what 
is required for sampling of the bound polynucleotides (/. e., the probe or probes) by the 
labeled polynucleotides so that the mixture is close to or substantially reached 

76 



y/O 2004/061616 PCT/US2003/041613 

equilibrium, and duplexes are at concentrations dependent on affinity and abundance 
rather than diffusion. However, the hybridization times are preferably short enough that 
irreversible binding interactions between the labeled polynucleotide and the probes and/or 
the surGace do not occur, or are at least limited For example, in embodiments wherein 
5 polynucleotide arrays are used to probe a complex mixture of fragmented 
polynucleotides, typical hybridization times may be approximately 0-72 hours. 
Appropriate hybridization times for other embodiments will depend on the particular 
polynucleotide sequences and probes used, and may be determined by those skilled in the 
art (see, ^.g., Sambrook et aL, Eds., 1989, Molecular Cloning: A Laboratory Manual, . 
10 2nd ed., Vol. 1-3, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York). 

In one embodiment, hybridization levels at different hybridization times are 
measured separately on different, identical microarrays. For each such measurement, at 
hybridization time when hybridization level is measured, the microarray is washed 
briefly, preferably in room temperature in an aqueous solution of high to moderate salt 

IS concentration (eg., 0.5 to 3 M salt concentration) under conditions which retain all bound 
or hybridized polynucleotides while removing all unbound polynucleotides. The 
' detectable label on the remaining, hybridized polynucleotide molecules on each probe is 
. then measured by a method which is appropriate to the particular labeling method used. 
The resulted hybridization levels are then combined to form, a hybridization curve. In 

20 another embodiment, hybridization levels are measured in real time using a single 
microarray. In this embodiment, the microarray is allowed to hybridize to the sample 
without interruption and the microarray is interrogated at each hybridization time, in a 
non-invasive maimer. In still another embodinient, one can use one array, hybridize for a 
short time, wash and measure the hybridization level, put back to the same sample, 

25 hybridize for another period of time, wash and measure again to get the hybridization 
time curve. 

Preferably, at least two hybridization levels at two different hybridization times 
are measured, a first one at a hybridization time that is close to the time scale of cross- 
hybridization equilibrium and a second one measured at a hybridization time that is 
30 long^ than the first one. The time scale of cross-hybridization equilibrium depends, inter 
' alia, on sample composition and probe sequence and may be determined by one skilled in 
the ait In prefixed onbodiments, the first hybridization level is measured at between 1 
to 10 hours, whereas the second hybridization time is measured at about 2, 4, 6, 10, 12, 
16, 18, 48 or 72 times as long as the first hybridization time. 

77 



wo 2004/061616 PCT/US2003/041613 

5.8.1.1. PSEPAIONG PROBES FOR MICROARRAYS 
As noted above, Ihe "probe" to which a particular polyimcleotide molecule, such 
as an exon, specificaUy hybridizes according to the invention is a complementary 
5 polynucleotide sequence. Preferably one or more probes are selected for each target 
exon. For example, when a minimum number of probes are to be used for the detection 
of an exon, the probes normally comprise nucleotide sequences greater than about 40 
bases in length. Alternatively, when a large set of redundant probes is to be used for an 
exon, the probes nomially coiiq>rise nucleotide sequences of about 40-60 bases. The 

10 probes can also comprise sequences complementary to full length exons. The lengths of 
exons can range firom less than 50 bases to more than 200 bases. Therefore, when a 
probe length longer than exon is to be used, it is preferable to augment the exon sequence 
with adjacent constitutively spliced exon sequences such that the probe sequence is 
complementary to the continuous mRNA fragment that contains the target exon. This 

15 will allow comparable hybridization stringency among the probes of an exon profiling 
array. It will be understood that each probe sequence may also comprise linker sequences 
in addition to the sequence that is complementary , to its target sequence. 

The probes may comprise DNA or DNA '*mimics" (e.g., derivatives and 
analogues) corresponding to a portion of each exon of each gene in an organism's 

20 genome. In one embodiment, the probes of the microairay are complementary KNA or 
KNA mimics. DNA mimics are polymers composed of subunits capable of specific, 
Watson-Crick-like hybridization with DNA, or of specific hybridization with RNA. The 
nucleic acids can be modified at the base moiety, at the sugar moiety, or at Ihe phosphate 
backbone. £xCT:q)lary DNA mimics include, e.g.,phosphorotfaioates. DNA can be 

25 obtained, e.g;, by polymerase chain reaction (PCR) amplification of exon segments from 
genomic DNA, cDNA (e.g., by RT-PCR), or cloned sequences. PCR primers are 
preferably chosen based on known sequence of the exons or cDNA that result in 
amplification of unique fragments (z.^., fragments that do not share more than 10 bases of 
contiguous identical sequence with any other fragment on the microarray). Computer 

30 programs that are well known in the art are usefril in the design of primers with the 
required specificity and optimal amplification properties, such as Oligo version 5.0 
(National Biosciences). Typically each probe on the microarray will be between 20 bases 
and 600 bases, and usually between 30 and 200 bases in length. PCR methods are well 
known in die art, and are described, for example, in hmis et al, eds., 1990, PCR 

78 



wo 2004/061616 PCTAJS2003/041613 

Protocols: A Guide to Methods and Applications^ Academic Press Inc., San Diego, CA. 
It will be apparent to one skilled in the art tiiat controlled robotic systems are useful for 
isolating and amplij^g nucleic acids. 

An alternative, preferred means for generating the polynucleotide probes of the 
3 microairay is by synthesis of synthetic polynucleotides or oligonucleotides, e.g., using N- 
phosphonate or phosphoramidite chemistries (Froehler et al , 1 986, Nucleic Acid Res. 
i-/:5399-5407; McBride et al, 1983, Tetrahedron Lett, 2^:246-248). Synthetic sequences 
are typically between about 15 and about 600 bases in length, more typically between 
about 20 and about 100 bases, most preferably between about 40 and about 70 bases in 
10 length. In some embodiments, synthetic nucleic acids include non-natural bases, such as, 
but by no means limited to, inosine. As noted above, nucleic acid analogues may be used 
as binding sites for hybridization. An example of a suitable nucleic acid analogue is 
peptide nucleic acid (see, eg., Eghohn et ai, 1993, Nature 555:566-568; U.S. Patent No. . 
5,539,083). 

15 In alternative embodiments, the hybridization sites (i.e., the probes) are made 

from plasmid or phage clones of genes, cDNAs (e.g., expressed sequence tags), or inserts 
therefrom (Nguyen et al., 1995, Genomics 29:201-109). 

5.8.1^. ATTACHING NUCLEIC ACIDS TO THE SOLID SURFACE 

20 Preformed polynucleotide probes can be deposited on a support to form the array. 

Alternatively, polynucleotide probes can be synthesized directly on the support to form 
the array. The probes are attached to a solid support or surface, which may be made, e.g., 
from glass, plastic (e.g., polypropylene, nylon), polyacrylamide, nitrocellulose, gel, or 
other porous or nonporous material. 

25 A prefeired method for attaching the nucleic acids to a surface is by printing on 

glass plates, as is described generally by Schena et al, 1995, Science 270:467-470. This 
method is especially usefril for preparing microairays of cDNA (See also, DeRisi et al, 
1996, Nature Getietics 7^:457-460; Shalon et al, 1996, Genome Res. 6:639-645; and 
Schena et al, 1995, Proc. Natl Acad. Set U.SA. Pi:10539-1 1286). 

■ • 

30 A second prefenred me&od for making microarrays is by making high-density 

polynucleotide arrays. Techniques are known for producing arrays containing thousands 
of oligonucleotides complementary to defined sequences, at defined locations on a 

79 



wo 2004/061616 PCT/US2003/041613 

siirface using photolithographic techniques for synthesis in situ (see, Fodor et a/., 1991, 
Science 251:767-173; Pease etaL, 1994,Proc. NatL Acad. ScL U.SA, Pi:5022-5026; 
Lockhart et al., 1996, Nature Biotechnology 14:1675; U.S. Patent Nos. 5,578,832; 
5,556,752; and 5,510,270) or other methods for rapid syuthesis and deposition of defined 
5 oligonucleotides (Blanchard et al , Biosensors & Bioelectronics 1 1 :687*690). Whm 
these methods are used, oligonucleotides (eg., 60-iners) of known sequence are 
synthesized directly on a surface such as a derivatized glass slide. The array produced 
can be redundant, with several polynucleotide molecules per exon. 

Other methods for making microarrays, e.g., by masking (Maskos and Southern, 
10 1992, NucL Acids. Res. 20:1679-1684), may also be used. In principle, and as noted 
supra^ any type of array, for example, dot blots on a nylon hybridization membrane (see 
Sambrook et al^ supra) could be used. However, as will be recognized by those skilled 
in the art, very small arrays will frequently be preferred because hybridization volumes 
will be smaller. 

15 In a particularly preferred embodiment, microarrays of the invention are 

manufactured by means of an ink jet printing device for oligonucleotide synthesis, e.g.^ 
using the methods and systems described by Blanchard in International Patent PubUcation 
No. WO 98/41531, published September 24, 1998; Blanchard et al, 1996, Biosensors 
and Bioelectronics i7:687-690; Blanchard, 1998, in Synthetic DNA Arrays in Genetic 

20 Engineeringy Vol. 20, J.K. Setlow, Ed., Plenum Press, New York at pages 1 1 1-123; and 
U.S. Patent No. 6,028,189 to Blanchard. Specifically, the polynucleotide probes in such 
microarrays are preferably synthesized in arrays, e.g., on a glass slide, by serially 
depositing individual nucleotide bases in ••microdroplets" of a high surface tension 
solvent such as propylene carbonate. The microdroplets have small volumes (e.g., 

25 100 pL or less, more preferably 50 pL or less) and are separated from each other on the 
microarray (e.g., by hydrophobic domains) to form circular surface tension wells which 
define the locations of the array elements (z. e. , the different probes). Polynucleotide 
probes are normally attached to the surfrice covalently at the 3' end of the polynucleotide. 
Alternatively, polynucleotide probes can be attached to the surface covalently at the 5 ' 

30 end of the polynucleotide (see for example, Blanchard, 1 998, in Synthetic DNA Arrays 
in Genetic Engineering, VoL 20, J.IL Setlow, Ed., Plenum Press, New York at pages 111- 
123). 



80 



wo 2004/061616 PCT/nS2003/041613 

5-8.1.3. TARGET POLYNUCLEOTIDE MOLECULES 
Target polynucleotides which may be analyzed by flie methods and compositions 
of the invention mclude RNA molecules such as, but by no means limited to, messenger 
RNA (mRNA) molecules, libosomal RNA (rKNA) molecules, cRNA molecules (/.e, 
5 RNA molecules prepared from cDNA molecules that are transcribed in vivo) and 

fragments thereof. Target polynucleotides which may also be analyzed by the methods 
and compositions of the present invention include, but are not limited to DNA molecules 
such as genomic DNA molecules, cDNA molecules, and fragments thereof including 
oligonucleotides, ESTs, STSs, etc. 

10 The target polynucleotides may be from any source. For example, the target 

polynucleotide molecules may be naturally occurring nucleic acid molecules such as 
genomic or ^tragenomic DNA molecules isolated from an organism, or KNA molecules, 
such as niRNA molecules, isolated from an organism. Altematively, the polynucleotide 
molecules may be synthesized, including, e.g, , nucleic acid moleciiles synthesized 

15 enzymatically in vivo or in vitro, such as cDNA molecules, or polynucleotide molecules 
synthesized by PGR, RNA molecules synthesized by in vitro transcription, etc. The 
sample of target polynucleotides can comprise, e.g., molecules of DNA, RNA, or 
copolymers of DNA and RNA. In preferred embodiments, the target polynucleotides of 
the invention will correspond to particular genes or to particular gene transcripts (e.g., to 

20 particular mRNA sequences expressed iii cells or to particular cDNA sequences derived 
from such mRNA sequences). • However, in many embodiments, particularly those 
embodiments wherein the polynucleotide molecules are derived from mamm a lian cells, 
the target polynucleotides may correspond to particular fragments of a gene transcript 
For example, the target polynucleotides may correspond to different exons of the same 

25 gene, e.g., so that different splice variants of that gene may be detected and/or analyzed. 

In preferred embodiments, the target polynucleotides to be aiialyzed are prepared 
in vitro from nucleic acids extracted from cells. For example, in one embodiment, RNA 
is extracted from cells (eg., total cellular RNA, polyCA)"*" messenger RNA, fraction 
thereof and messenger RNA is purified from the total extracted RNA. Methods for 
30 prq)aring total and poly(A)^ RNA are well knoAvn in the art, and.are described generally, 
e,g., m Sambrook et aL, supra. In one embodiment, RNA is extracted from cells of the 
various types of interest in this invention using guanidinium thiocyanate lysis followed by 
CsCl centrifiigation and an oligo dT purification (Chirgwin et al., 1979, Biochemistry 
75:5294-5299). La another embodiment, RNA is extracted from cells using guanidinium 

81 



wo 2004/061616 PCTAJS2003/041613 

tiliocyanate lysis followed by purification on RNeasy columns (Qiagen). cDNA is then 
synthesized firom the purified mRNA iising, e.g., oligo^lT or random primers. In 
preferred embodiments, the target polynucleotides are cRNA prepared from purified 
messenger RNA extracted from cells. As used herein, cRNA is defined here as RNA 
S complementary to the source RNA. The extracted RNAs are amplified using a prpcess in 
which doubled-stranded cDNAs are synthesized from the KNAs using a primer linked to 
an RNA polymerase promoter in a direction capable of directing transcription of anti- 
sense RNA Anti-sense RNAs or cRNAs are Hbsn transcribed from the second strand of 
the double-stranded cDNAs using an RNA polymerase (see, e.g, , U.S. Patent Nos. 
10 5,891,636, 5,71 6,785; 5,545,522 and 6,132,997; see also, U.S. Patent No. 6,271,002, and 
U.S. Provisional Patent Application Serial No. 60/253,641, filed on November 28, 2000, 
by Ziman et aL). Both oligo-dT primers (U.S. Patent Nos. 5,545,522 and 6,132,997) or 
random primers (U.S. Provisional Patent Application Serial No. 60/253,641, filed on 
Novemb^ 28, 2000, by Ziman et aL) that contain an RNA polymerase promoter or 
15 . c 



II 




. 




12411 



population of the cell. 

The target polynucleotides to be analyzed by the methods and compositions of the 
invention are preferably detectably labeled. For example, cDNA can be labeled directly, 
20 with nucleotide analogs, or indirectiy, e.g., by making a second, labeled cDNA 

strand using the first strand as a template. Alternatively, the double-stranded cPNA can 
be transcribed into cRNA and labeled. 

Preferably, the detectable label is a fluorescent label, e.g. , by incorporation of 
nucleotide analogs. Other labels suitable for use in the present invention include, but are 
25 not limited to, biotin, imminobiotin, antigens, cofactors, dinitrophenol, lipoic acid, 

olefinic compounds, detectable polypeptides, electron rich molecules, enzymes capable of 
generating a detectable signal by action upon a substrate, and radioactive isotopes. 
Preferred radioactive isotopes include ^^P, ^^S, ^'^C, and ^^L Fluorescent molecules 
suitable for the present invention include, but are not limited to, fluorescein and its 

30 derivatives, rhodamine and its derivatives, texas red, 5'carboxy-fluorescein ('TMA"), 



2',7'-<limethoxy-4',5'-dichloro-6-carboxy-£luorescein ("JOE"), N^J^',N'- tetramethjd- 
6-carboxy-diodamine ("TAMRA"), d'carboxyrX-riiodamine ("ROX"), HEX, TET, 
IRD40, and niD41 . Fluroesceat molecules Hbat are suitable for the invention further 
include: cyamine dyes, including by not limited to Cy3, Cy3.S and Cy5; BODIPY dyes 

82 



f 



wo 2004/061616 PCT/US2003/O41613 

including but not limited to BODIPY-FL. BODIPY-XR, BODIPY-TMR, BODIPY- 
630/650, and BODlPY-650/670; and ALEXA dyes, including but not limited to 
ALEXA-488, ALEXA-532, ALEXA-546, ALEXA-568, and ALEXA-594; as weU as 
other fluoiescent dyes which will be known to those who are skilled in the art Election 

5 rich indicator molecules suitable for the present invention include, but are not limited to, 
ferritin, hemocyanin, and colloidal gold. Alternatively, in less preferred embodiments the 
target polynucleotides may be labeled by specifically complexing a first group to the 
polynucleotide, A second group, covalently linked to an indicator molecules and which 
has an affinity for the first group, can be used to indirectly detect the target 

10 polynucleotide. In such an embodiment, compounds suitable for use as a first group 
include, but are not limited to, biotin and iminobiotin. Compounds suitable for use as a 
second groiq) include, but are not limited to, avidin and streptavidin. 



5.8.1.4. HYBRIDIZATION TO MICROARRAYS 

15 As described suprUy nucleic acid hybridization and wash conditions are chosen so 

that the polynucleotide molecules to be analyzed by the invention (referred to herein as 

• « 

the 'target polynucleotide molecules) specifically bind or specifically hybridize to the 
/ complementary polynucleotide sequences of the array, preferably to a specific array site, 
wherein its complementary DNA is located 

! 

20 Arrays containing double-stranded probe DNA situated thereon are preferably 

subjected to denaturing conditions to render the DNA single-stranded prior to contacting 
with the target polynucleotide molecules. Arrays containing singl^stranded probe DNA 
(e.g., synthetic oHgodeoxyribonucleic.acids) may need to be denatured prior to contacting 
with the target polynucleotide molecules, eg., to remove hairpins or dimers which form 

25 due to self complementary sequences. 

Optimal hybridization conditions will depend on the l^gth {e.g., oligomer versus 
polynucleotide greater than 200 bases) and type KNA, or DNA) of probe and target 
nucleic acids. General parameters for specific stringent) hybridization conditions for 
nucleic acids are described in Sambrook et al.^ (stq^ra), and in Ausubel et aL, 1987, 
30 Current Protocols in Molecular Biology, Greene Publishing and Wiley-Intersci«ice, New 
York. When the cDNA microarrays of Schena etal ato used, typical hybridization 
conditions are hybridization in 5 X SSC plus 0.2% SDS at 65 ""C for four hours, followed 
by washes at 25°C in low stringency wash buffer (1 X SSC plus 0.2% SDS), followed by 

83 



wo 2004/061616 PCTAUS2003/041613 

10 minutes at 25^C in higher stringency wash buffer (0.1 X SSC plus 0.2% SDS) (Shena 
etal, 1996, Proc. Natl Acad, Sci. U.SA. P3.10614). Useful hybridization conditions are 

m • 

also provided io, ag., Tijessen, 1993, Hybridization With Nucleic Acid Probes, Elsevier 
Science Publishers B.V. and Kricka, 1992, Nonisotopic DNA Probe Techniques, 
5 Academic Press, San Diego, CA. 

Particularly preferred hybridization conditions for use with the screening and/or 
signaling chips of the presmt invention include hybridization at a temperature at or near 
the mean melting temperature of the probes (e.g., within 5 °C, more preferably within 
2 °C) in 1 M NaCl, 50 mM MES buffer (pH 6.5), 0.5% sodium Sarcosine and 30% 
10 formamide. 

■ 

5.8.1.5. SIGNAL DETECTION AND DATA ANALYSIS 

It will be appreciated that when target sequences, e.g., cDNA or cRNA, 
complementary to the RNA of a cell is made and hybridized to a microairay under 
15 suitable hybridization conditions, the level of hybridization to the site hi the array 

corresponding to an exon of any particular gene will reflect the prevalence in the cell of 
mKNA or mRNAs containing the exon transcribed from that gene. For example, when 

■ 

detectably labeled with a fluorophore) cDNA complementary to the total cellular 
mRNA is hybridized to a microarray, the site on the array corresponding to an exon of a 

20 gene (i.e., capable of specifically binding the product or products of the gene expressing) 
. that is not transcribed or is removed during RNA spUcing in the cell will have little or no 
signal (e.g., fluorescent signal), and an exon of a gene for which the encoded mRNA 
expressing the exon is prevalent will have a relatively strong signal. The relative 
abundance of different naRNAs produced from the same gene by alternative splicing is 

25 then detemained by the signal strength pattem acroiss the whole set of exons monitored for 
the gene. 

In preferred ^bodiments, target sequoices, e.g., cDNAs or cRNAs, Jfrom two 
different cells are hybridized to the binding sites of the microarray. In the case of drug 
responses one cell sample is exposed to a drug and another cell sample of the same type is 
30 not exposed to the drug. In the case of pathway responses one cell is exposed to a 
pathway perturbation and another cell of the same type is not exposed to the pathway 
perturbation. The cDNA or cRNA derived from each of the two cell types are differently 
labeled so that they can be distinguished. In one embodiment, for example, cDNA from a 



84 



wo 2004/061616 PCTAJS2003/041613 

cell treated with a drug (or exposed to a pathway perturbation) is synthesized using a 
fluorescein-labeled dNTP, and cDNA from a second cell, not drug-exposed, is 

* • 

synthesized using a rhodamine-labeled dNTP. When the two cDNAs are mixed and 
hybridized to the microarray, die relative intmsity of signal from each cDNA set is 
S determined for each site on the array, and any relative difference in abundance of a 
particular exon detected 

In the example described above, the cDNA from the drug-treated (or pathway 
perturbed) cell will fluoresce green when the fluorophore is stimulated and the cDNA 
from the untreated cell will fluoresce red. As a result, when the drug treatment has no 

10 effect, either directly or indirectly, on the transcription and/or post-transcriptional splicing 
of a particular gene in a cell, the exon expression patterns will be indistinguishable in 
both cells and, upon reverse transcription, red-labeled and green-labeled cDNA will be 
equally prevalent. When hybridized to the microarray, the binding site(s) for that species 
of KNA will emit wavelengths characteristic of both fluorophores. In contrast, when the 

15 drug-exposed cell is treated with a drug that, directly or indirectly, changes the 

transcription and/or post-transcriptional spUcing of a particular gene in the cell, the exon 
expression pattern as represented by ratio of green to red fluorescence for each exon 
binding site will change. When the drug increases the prevalence of an mRNA, the ratios 
for each exon expressed in the mRNA will increase, whereas when the drug decreases the 

20 prevalence of an mRNA, the ratio for each exons expressed in the mKNA will decrease. 

The use of a two-color fluorescence labeling and detection spheme to define 
alterations in gene expression has been described in connection with detection of 
inRNAs, in Shena et al, 1995, Quantitative monitoring of gene expression patterns 
with a complemmtary DNA microarray. Science 270:467-470, which is incorporated by 
25 reference in its entirety for all purposes. The scheme is equally applicable to labeling and 

4 

detection of exons. An advantage of using target sequences, e.g., cDNAs or cRNAs, 
labeled with two different fluorophores is that a direct and internally controlled 
comparison of the mRNA or exon expression levels corresponding to each arrayed gene 
in two cell states can be made, and variations due to minor differences in experimental 
30 conditions (e.^., hybridization conditions) will not affect subsequent analyses. However, 
it wiU be recognized that it is also possible to use cDNA from a single cell, and compare, 
for example, the absolute amount of a particular exon in, e.g., a drug-treated or 
pathway-perturbed cell and an untreated cell. 



r 



85 



wo 2004/061616 PCT/US2003/041613 

When fluorescently labeled probes are used, the fluorescence emissions at each 
site of a transcript array can be, preferably, detected by scanning confocal laser 
microscopy. In one embodiment, a separate scan, using the appropriate excitation line, is 
carried out for each of the two fluorophores used Alternatively, a laser can be used that 

5 allows simultaneous specimen illumination at wavelengths specific to the two 

fluorophores and emissions &om the two fluorophores can be analyzed simultaneoiisly 
(see Shalon et al, 1996, Genome Res. 6:639-645). In a preferred embodiment, the arrays 
are scanned with a laser fluorescence scanner with a coinputer controlled X-Y stage and a 
microscope objective. Sequential excitation of the two fluorophores is achieved with a 

1 0 multi-line, mixed gas laser, and the emitted light is split by wavelength and detected with 
two photomultiplier tubes. Such fluorescence laser scanning devices are described, e.g., 
in Schena et al, 1996, Genome Res. 5:639-645. Alternatively, the fiber-optic bundle 
described by Ferguson a/., 1996, Nature Bioteck i^:1681-1684, maybe used to 
monitor mKNA abundance levels at a large number of sites simultaneously. 

15 Signals are recorded and, in a preferred embodiment, analyzed by computer, , 

. . using a 12 bit analog to digital board. In one embodiment, the scaimed image is 
despeckled using a graphics program (e.g., Hijaak Graphics Suite) and then analyzed 
using an image gridding program that creates a spreadsheet of the average hybridization 
at each wavelength at each site. If necessary, an experimentally determined correction for 

20 "cross talk" (or overlap) betwiBOT the channels for the two fluors may be made. For any 
particular hybridization site on the transcript array, a ratio of the emission of the two 
fluorophores can be calculated. The ratio is independent of the absolute expression level 
of the cognate gene, but is useful for genes whose expressionis significantly modulated 
by drug administration, gene deletion, or any other tested event 

25 According to the method of the invention, the relative abundance of an mRNA 

and/or an exon expressed in an mRNA in two cells or cell lines is scored as perturbed 
(i.e. , the abundance is different in the two sources of mRNA tested) or as not perturbed 
(z.e., the relative abundance is the same). As used herein, a difference between the two 
sources of RNA of at least a factor of about 25% RNA is 25% more abundant in one 

30 soxirce than in the other source), more usually about .50%, even more often by a factor of 
about 2 twice as abundant), 3 (three times as abundant), or 5 (five times as abundant) 
is scored as a perturbation. Present detection methods allow reliable detection of • 
differences of an order of about 1.5 fold to about 3-fold. 



/ 



86 



wo 2004/061616 PCT/US2003/041613 

It iSy however, also advaatageous to determine the magnitude of the relative 
difference in abundances for an mKNA and/or an exon expressed in an mKNA in two 
cells or in two cell lines. This can be carried out, as noted above, by calculating the ratio 
of the emission of the two fluorophores used for differential labeling, or by analogous 
methods that will be readily apparent to those of skill ra the art. 



5.8.2. OtmR METHODS OF IltANSCiaPTIONAL STATE MEASm 

The transcriptional state of a cell may be measured by other gene expression 
technologies known in the art. Several such technologies produce pools of restriction 
10 fragments of limited compl^ty for electrophoretic analysis, such as methods combining 

i 

double restriction enzyme digestion with phasing primers (see, e.g., European Patent O 
534858 Al, filed September 24, 1992, by Zabeau et al\ or methods selecting restriction 
fiBgments with sites closest to a defined mRNA end {see, e.g., Prashar et aL, 1996, Proc. 
Natl. Acad. Sci. USA 93:659-663). Other methods statistically sample cDNA pools, such 
15 as by sequencing sufficient bases (eg., 20-50 bases) in each of multiple cDNAs to 

identify each cDNA, or by sequencing short tags (e.g., 9-10 bases) that are gen^ted at 
known positions relative to a defined mRNA end (see, e.g., Velculescu, 1995, Science 
270:484-487). 



20 5.9. MEASUREMENT OF OTHER ASPECTS OF THE BIOLOGICAL STATE 

In various embodim^ts of the present invention, aspects of the biological state 
other than the transcriptional state, such as the .translational state, the activity state, or 
mixed aspects can be measured. Thus, in such embodiments, gene expression data 44 
(Fig. 1) may include translational state measurements or even protein expression 

25 measurements. In fact, in some embodiments, rather than using gene expression 

interaction maps based on gene expression, protein expression interaction maps based on 
protein expression maps are used. Details of embodiments in which aspects of the 
biological state other than the transcriptional state are described in this section. 



30 5,10. TRANSLATIONAL STATE MEASUREMENTS 

Measurement of the translational state may be performed according to several 
methods. For example, whole genome monitoring of protein the '^proteome," 
Goffeau et al,^ supra) can be carried out by constructing a microarray in which binding 

87 



wo 2004/061616 PCT/US2003/041613 

sites comprise immobilized, prefeably monoclonal, antibodies specific to a plurality of 
protein species encoded by the cell genome. Preferably, antibodies are present for a 
substantial fraction of the encoded proteins, or at least for those proteins relevant to the 
action of a drag of interest Methods for making monoclonal antibodies are well known 

5 (see, e.g,, Harlow and-Lane, 1 988, Antibodies: A Laboratory Manual, Cold Spring 
Harbor, New Yoik, which is incorporated in its entirety for all purposes). In a preferred 
embodiment, monoclonal antibodies are raised against synthetic peptide fragments 
desired based on genomic sequence of the cell. With such an antibody array, proteins 
from flie cell are contacted to the array and their binding is assayed with assays known in 

10 the art. 

Alternatively, protein® can be separated by two-dimensional gel electrophoresis 
systems. Two-dimensional gel electrophoresis is well-known in the art and typically 
involves iso-electric focusing along a first dimension followed by SDS-PAGE 
electrophoresis along a second dimension. See, Hames et al^ 1990, Gel 

15 Electrophoresis of Proteins: A Practical Approach, IRL Press, New York; Shevchenko et 
al, 1996, Proc, Natl Acad. Set USA 93:1440-1445; Saghocco et al, 1996, Yeast 
12:1519-1533; Lander, 1996, Science 274:536-539. The resulting electropherograms can 
be analyzed by numerous techniques, including mass spectrometric techniques. Western 
blottiag and immimoblot analysis using polyclonal and monoclonal antibodies, and 

20 internal and N-terminal micro-sequencing. Using these techniques, it is possible to 
identify a substantial fraction of all the proteins produced under given physiological 
conditions, including in cells (eg*., in yeast) exposed to a drug, or in cells modified by, 
e.g., deletion or over-expression of a specific gene. 

25 5.11. MEASUBING OTHER ASPECTS OF THE BIOLOGICAL STATE 

Even though methods of this invention are illustrated by embodiments involving 
gene expression or translation, the methods of the invention are applicable to any cellular 
constituent that can be monitored. For example, where activities of proteios can be 
measured, embodiments of this invention can use such measurements. Activity 

30 measurements can be performed by any functional, biochemical, or physical means 

■ ■ * ' ' ' 

. ^propriate to the particular activity being characterized. Where the activity involves a 
chemical transformation, the cellular protein can be contacted with the natural 
substrate(s), and the rate of transformation measured. Where the activity involves 

88 



wo 2004/061616 PCT/nS2003/041613 

association in multimeric units, for example association of an activated DNA binding 
complex with DNA, the amount of associated protein or secondary consequences of the 
association, such as amounts of mKNA transcribed, can be measured. Also, where only a 
functional activity is known, for example, as in cell cycle control, performance of the 
S function can be observed; However known and measured, the changes in protein 
activities form the response data analyzed by the foregoing methods of this invention. 

In some embodiments of the present inventioi}, cellular constituent measurements 
are derived from cellular phenot>pic techniques. One such cellular phenotypic technique 
uses cell respiration as a universal reporter. Li one embodiment, 96-well microtiter plate, 

10 in which each well contains its own unique chemistry is provided. Each unique chemistry 
is designed to test a particular phenotype. Cells from the organism of interest are pipetted 
into each well. If the ceUs exhibits the appropriate phenotype, they will respire and 
actively reduce a tetrazolium dye, forming a strong purple color. A weak phenotype 
results in a ligjiter color. No color means that the cells don't have the specific phenotype. . 

IS Color changes can be recorded as often as several times each hour. During one 

incubation, more than 5,000 phenotypes can be tested. See, for example, Bochner et aL^ 
2001, Genome Research 1 1, p. 1246. 

In some embodiments of the present inventioix, cellular constituent measurements 
are derived from cellular phenotypic techniques. One such cellular phenotypic technique 

20 uses cell respiration as a universal reporter. In one embodiment, 96-well microtiter 
plates, in which each well contains its own unique chemistry is provided. Each unique 
chemistry is designed to test a particular phenotype. Cells from the organism 46 (Fig. 1) 
of interest are pipetted into each well. If the cells exhibit the appropriate phenotype, they 
will respire and actively reduce a tetrazolium dye, forming a strong purple color. A weak 

25 phenotype results in a lighter color. No color means that the cells don't have the specific 

< 

phenotype. Color changes may be recorded as often as several times each hour. During 
one incubation, more than 5,000 phenotypes can be tested. See, for example, Bochner et 
al., 2001, Genome Research 1 1, 1246-55. 

In some embodiments of the present invention, the cellular constituents that are 
30 measured (gene expression data 44) are metabolites. MetaboUtes include, but are not 
limited to, amino acids, metals, soluble sugars, sugar phosphates, and complex 
carbohydrates. Such metaboUtes may be measured, for example, at the whole-cell level 
using.methods such as pyrolysis mass spectrometry (Irwin, 1982, Analytical Pyrolysis: A 



wo 2004/061616 PCT/US2003/041613 

Comprehensive Guide, Marcel Dekker, New York; Meuzelaar al, 1982, Pyrolysis 
Mass Spectrometry of Recent and Fossil Biomaterials, Elsevier, Amsterdam), 
fourier-transform infrared spectrometry (Griffiths and de Haseth,1986, Fourier transform 
infrared spectrometry, John Wiley, New York; Hehn et al, 1991, J. Gen- Microbiol. 137, 

5 69-79; Naumann et al, 1991, Nature 351, 81-82; Naumann et al, 1991, In: Modem 

techniques for rapid microbiological analysis, 43-96, Nelson, W.H., ed., VCH Publishers, 
New York), Raman spectrometry, gas chromotagraphy-mass spectroscopy (GC-MS) 
(Fiehn et al, 2000, Nature Biotechnology 18, 1157-1161, capillary electrophoresis 
(CE)yMS, high pressure liquid chromatogr^hy / mass spectroscopy (HPLC/MS), as well 

10 as liquid chromatography (LC)-Electrospray and cap-LC-tandem-electrospray mass 
spectrometries. Such methods can be combined with established chemometric methods 
that make use of artificial neural networks and genetic programming in order to 
discriminate between closely related samples. 

15 5,12. EXEMPLARY DISEASES 

As discussed supra, the present invention provides ah apparatus and method for 

' ♦ * 

associating a gene with a trait exhibited by one or more organisms in a plurality of 
organisms of a single species. In some instances, the gene is associated with the trait by 
identifying a biological pathway in which the gene product participates. In some 
20 embodiments of the present invention, the trait of interest is a complex trait, such as a 
disease, a human disease. Exemplary diseases include asthma, ataxia telangiectasia 
(Jaspers and Bootsma, 1982, Proc. Natl Acad, ScL USA. 79: 2641), bipolar disorder, 
common cancCTS, conomon late-onset Alzheimer's disease, diabetes, heart disease, 
hereditary early-onset Alzheimer's disease (George-Hyslop et al, 1990, Nature 347: 
25 194), hereditary nonpolyposis colon cancer, hypertension, infection, maturity-onset 

diabetes of the young (Barbosa et al, 1976, Diabete Metab. 2: 160), mellitus, migraine, 
nonalcohoUc fatty Uver (NAFL) (Younossi, et al, 2002, Hepatology 35, 746-752), 
nonalcohoUc steatohepatitis (NASH) (James & Day, 1998, J. Hepatol 29: 495-501), 
non-insulin-dependent diabetes mellitus, obesity, polycystic kidney disease (Reeders et 
30 al, 1987, Human Genetics 76: 348), psoriases, schizophrenia, steatohepatitis and 

xerbdemia pigmentosum (De Weerd-Kastelein, Nat New Biol 23 8 : 80). Genetic ' 
heterogeneity hanapers genetic m£5)ping, because a chromosomal region may cosegregate 
with a disease in some families but not in others. 

« 

90 



wo 2004/061616 



PCT/US2003/O41613 



5.13. LINKAGE ANALYSIS 
This section describes a numbCT of standard quantitative trait locus (QTL) linkage 
analysis algorithms that can be used in various ^bodiments of processing step 210 (Fig. 

5 2) and/or processing step 1910 (Fig. 19). Such linkage analysis is also sometimes 
referred to as QTL analysis. See, for example. Lynch and Walsdx, 1998, Genetics and 
Analysis of Quantitative Traits, Sinauer Associates, Sunderland, MA. The primary aim of 
linkage analysis is to determine whether there exist pieces of the genome that are passed 
down through each of several families with multiple afflicted organisms in a pattern that 

10 is consistent with a particular inheritance model and that is unlikely to occur by chance 
alone. In other words, the purpose of these algorithms is to identify a loci (e.g., a QTL) 
for a phenotypic trait exhibited by one or more organisms 46. A QTL is a region of a 
genome of a species that is responsible for a percentage of variation in a phenotypic trait 
in the species under study, 

15 The recombination fraction can be denoted by d and is bounded between 0 and 

• 0.5. If fl == 0.5 for two loci, then aUeles at the two loci are transmitted independently 
half of the gametes being recombinant, for the two loci, and half parental. In this case, 
the loci are unlinked. If < 0.5, then alleles are not transmitted independently, and the 
two loci are linked. The extreme scenario is when 5 = 0, so that the two loci are 

20 completely linked, and there will be no recombination between the two loci dming 

meiosis, Le. all gametes are parental. Linkage analysis tests whether a marker locus, of 
known location, is linked to a locus of unknown location, that influences the phenotype 
mxder study. In other words, a QTL is identified by comparing genotypes of organisms in 
a group to a phenotype exhibited by the group using pedigree data. The genotype of each 

25 organism at each marker in a plurality of markers in a genetic map produced by marker 
genotypic data is compared to a given phenotype of each organism. The genetic map is 
created by placing genetic markers in genetic (linear) rxysp order so that the positional 
relationships between markers are understood. The information gained from knowing the 
relationships between markers that is provided by a maiker map provides the setting for 

30 addressing the relationship between QTL effect and QTL location. 

• • ... *. 

In some embodiments of the present invention, linkage analysis is based on any of 
the QTL detection me&ods disclosed or referenced in Lynch and Walsch, 1998, Genetics 
ondAnalyis of Qumvtitative Traits, Sinauer Associates, Inc., Sunderland, MA. 

91 



wo 2004/061616 PCT/US2003/041613 

■ 

5.13.1. PHENOTYPIC DATA USED 
It will be appreciated that the present invention provides no limitation on the type 
of phenotypic data that can be used to perform QTL analysis. The phenotypic data can, 
5 for example, represent a series of measurements for a quantifiable phenotypic trait in a 
collection of organisms. Such quantifiable phenotypic traits can include, for example, tail 
length, life span, eye color, size and weight. Alternatively, the phenotypic data can be in 
a binary form that tracks the absence or presence of some phenotypic trait As an 
example, a '^r' can indicate that a particular species of the organism of interest possesses 
10 a given phenotypic trait and a ""0" can indicate that a particular species of the organism of 
interest lacks the phenotypic trait. The phenotypic trait can be any foxm of biological data 
• that is representative ofthephenotypeofeach organism m the population u^ In 
some embodiments, the phenotypic traits are quantified and are often referred to as 
quantitative pheaotypes. 

15 • 

5.13.2. GENOTYPIC DATA USED 

In order to provide the necessary genotypic data for linkage analysis, the genotype 
of each marker in the genetic mark^ niap is determined for each organism in a population 
understudy. In essence, tbe genotypic infoimatibn comprises information about 
20 polymorphism at each marker location in the genome of the population, under study. 

Representative forms of polymorphisms used to construct genotypic information include, 
but are not limited to, single nucleotide polymorphisms, microsateUite markers, 
restriction firagment length polymorphisms, short tandem repeats, sequence length 
polymorphisms, and DNA methylation patterns. 

25 Linkage analyses use the genetic m^ derived fi-om marker genotypic data as the 

fiamewoik for location of QTL for any given quantitative trait hi some embodiments, 
the intervals that are defined by ordered pairs of markers are searched in increments (for 
example, 2 cM), and statistical methods are used to test whether a QTL is likely to be 
present at the location within the interval. In one embodiment, linkage analysis 

. * • • 

30 statistically tests for a. single QTL at each increment across the ordered markers in a 
genetic m^. The results of the tests are expressed as lod scores, which compares the 
evaluation of the likelihood function under a rmll hypothesis (no QTL) with the 
alternative hypothesis (QTL at the testing position) for the purpose of locating probable 



92 



wo 2004/061616 PCT/DS2003/041613 

QTL. More details on lod scores are found in Section 5.4, as well as in Lander and 
Schork, 1994, Science 265, p. 2037-2048, Interval m^ing searches through the ordered 
genetic markers in a systematic, linear (one-dimensional) fashion, testing the same null 
hypothesis and using the same form of likelihood at each increment 

5 

5.13.3. PEDIGREE DATA USED 
Linkage analysis requires pedigree data for organisms in the population under 
study in order to statistically model the segregation of markers. The various forms of 
linkage analysis can be categorized by the type of population used to generate the 
10 pedigree data (inbred versus outbred). 

Some forms of linkage analysis use pedigree data for populations that originate 
from inbred parental lines. The resulting Fi lines wiE tend to be heterozygous at all 
markers and QTL. From the Fi population, crosses are made. Exemplary crosses include 
backcrosses. Fa intercrosses, F/ populations (formed by randomly mating Fis for ^-1 
IS generations), F2:3 design (F2 individuals are genotyped and then selfed). Design m (F2 
from two inbred lines are backcrossed to both parental Unes). Thus, in some 
embodiments of the present invention, organisms represent a population, such as an F2 
population, and pedigree data for the F2 population is known. This pedigree data is used 
■ to compute logarithm of the odds (lod) scores, as discussed in further detail below. 

20 For many organisms, including humans, manipulatable inbred lines are not 

available and outbred populations must be used to perform linkage analysis. Linkage 
analysis using outbred populations detect QTLs responsible for within-population 
variation whereas linkage analysis using inbred populations detect QTLs responsible for 
jSxed differences between lines, or even different species. Using within-population 

25 variation (outbred population), as opposed to fixed differences between populations 

(inbred population) results in decreased power in QTL detection. With inbred Unes, all Fi 
parents have identical genotypes (including the same linkage phase), so all individuals are 
informative, and linkage disequilibrium is maximized. As with inbred lines, a variety of 
des^igns have been proposed for obtaining samples with linkage disequiUbrium required 

30 for linkage.analysis. Typically, collections of relatives are reUed upoa 

The major difference between QTL analysis using inbred-line crosses versus 
outbred populations is that while the par^ts in the former are genetically uniform, 
parents in the latter are genetically variable. This distinction has several consequmces. 

93 



F 



WO 2004/061616 PCT/US2003/041613 

Fiist, only a fraction of the parents from an outbred population are informative. For a 
. parent to provide linkage information, it must be heterozygous at both a marker and a 
linked QTL, as only in this situation can a marker-trait association be generated in the 
progeny. Only a fraction of random parents from an outbred population are such double 
5 heterozygotes. With inbred lines, Fi's are heterozygous at all loci that differ between the 
crossed lines, so that all parents are fidly informative. Second, there are only two alleles 
segregating at any locus in an inbred-line cross design, while outbred populations can be 
segregating any number of alleles. Finally, in an outbred population, individuals can 
differ in marker-QTL linkage phase, so that an jlf-bearing gamete might by associated 

10 with QTL allele Q in one parent, and with q in another. Thus, with outbred populations, 
marker-trait associations might be examined separately for each parent With inbred-line 
crosses, ail Fi parents have identical genotypes (including linkage phase), so one can 
average marker-trait associations over all off-spring, regardless of thek parents. See 
Lynch and Walsh, Genetics and Analysis of Quantitative Traits, Sinauer Associates, 

IS Simderiand, Massachusetts. 

5,13.4. MODEL FREE VERSUS MODEL BASED LINKAGE ANALYSIS 

Linkage analyses can generally be divided into two classes: model-based linkage 
analysis and model-free liokage analysis. Model-based linkage analysis assumes a model 
20 for the mode of inheritance whereas model-free linkage analysis does not assume a mode 
of inheritance. Model-free linkage analyses are also known as allele-shaiing methods and 
non-parametric linkage methods. Model-based linkage analyses are also known as 
""maximum likelihood'' and 'lod score" methods. Either form of linkage analysis can be 
used in the present invention. 

25 Model-based linkage analysis is most often used for dichotomous traits and 

requires assumptions for the trait model. These assumptions include the disease allele 
frequency and penetrance function. For a disease trait, particularly those of interest to 
public health, the true underlying model is complex and unknown, so that these 

* 

procedures are not applicable. The other form of linkage analysis (model-free liiikage 
30 analysis) makes use of allele-sharing. Allele-sharing methods rely on the idea that 

relatives with similar phenotypes should have similar genotypes at a marker locus if and 
only if the marker is linked to the locus of interest Linkage analyses are able to localize 
the locus of interest to a specific region of a chromosome, but the scope of resolution is 

94 



wo 2004/061616 PCT/US2003/041613 

typically limited to no less than 5 cM or roughly 5000 kb. For more information on 
model-based and model-free linkage analysis, see Olson et al^ 1999, Statistics in 
Medicine 18, p. 2961-298 1; Lander and Schork 1994, Science 265, p. 2037; and Elston, 
1998, Genetic Epidemiology 15, p. 565, as well as the sections below. 

5 

S.13.5. KNOWN PROGRAMS FOR PERFORMING LINKAGE ANALYSIS 
Many known programs can be used to perform linkage analysis in accordance 
with this aspect of the invention. One such program is MapMaker/QTL, which is the 
companion program to MapMaker and is the original QTL mapping software. 

10 Mq>Maker/QTL analyzes F2 or backcross data using standard interval mapping. Another 
such program is QTL Cartographer, which performs single-marker regression, interval 
mapping (Lander and Botstein, Id,), multiple interval mapping and composite interval 
mapping (Zeng, 1993, PNAS 90: 10972-10976; and Zeng, 1994, Genetics 136: 
1457-1468). QTL Cartographer pemiits analysis from F2 or backcross populations. QTL 

15 Cartographer is available from http://statgeiLncsu.edu/qtlcart/cartographer.html (North 
CaroUna State University). Another program that can be used by processing step 1 14 is 
: Qgene, which performs QTL mapping by either single-marker regression or interval 
regression (Martinez and Cumow 1994 Heredity 73:198-206) . . Using Qgene, eleven 
different population types (all derived from inbreeding) can be analyzed Qgene is 

20 available from http-y/www.qgene.org/. Yet another program is MapQTL, which conducts 
standard interval mapping (Lander and Botstein, Id\ multiple QTL mapping (MQM) 
(Jansen, 1993, Genetics 135: 205-211; Jansen, 1994, Genetics 138: 871-881), and 
nonparametric mapping (Kruskal-Wallis rank sum test). MapQTL can analyze a variety 
of pedigree types including outbred pedigrees (cross pollinators). MapQTL is available 

25 from Plant Research International, Plant Research International, P.O. Box 1 6, 6700 AA 
Wageningen, The Netherlands; 

http://www.plantwageningen-ur.nI/default.asp?section=products). Yet another program 
that may be used in some embodiments of processing step 210 is Map Manage QT, 
which is a QTL mapping program CManly and Olson, 1999, Mamm Genome 10: 
30 327-334). Map Manager QT conducts singje-maiker regression analysis, 
. . regression-based simple interval mapping (Haley and Knott, 1992, Heredity 69, 
315-324), composite interval mappmg (Zeng 1993, PNAS 90: 10972-10976), and 
permutation tests. A description of Map Manager QT is provided by the reference Manly 



95 



wo 2004/061616 PCT/US2003/041613 

and Olson, 1999, Overview of QTL mapping software and introduction to Map Manager 
QT, Mammalian Genome 10: 327-334. 

Yet another program that may be used to perform linkage analysis is MultiCross 
QTLj which maps QTL from crosses originating from inbred lines, MultiCross QTL uses 
5 a linear regression-model approach and handles different methods such as interval 

mapping, all-marker mapping, and multiple QTL mapping with cofactors. The program 
can handle a wide variety of simple moping populations for inbred and outbred species. 
MultiCross QTL is available from Unite de Biom6trie et Intelligence Artificielle, INRA, 
31326 Castanet Tolosan, France. 

10 Still another program that can be used to perform linkage analysis is QTL CsfL 

The program can analyze most populations derived from pure line crosses such as F2 
crosses, backcrosses, recombinant inbred lines, and doubled haploid lines. QTL Cafe 
incorporates a Java inq>lementation of Haley & Knotts' flanking mark^ regression as 
well as Marker regression, and can handle multiple QTLs. The program allows three 

1 5 types of QTL analysis single marker ANOVA, marker regression (Kearsey and Hyne, 
1994, Theor Appl. Genet., 89: 698-702), and interval mapping by regression, (Haley and 
Knott, 1992, Heredity 69: 315-324). QTL Caf6 is available from 
http://web.bham.ac,uk/g.g.seaton/. 

Yet another program that can be used to perform linkage analysis is MAPL, which 
20 performs QTL analysis by either interval mapping (Hayashi and Ukai, 1994, Theor. Appl. 
Genet. 87:1021-1027) or analysis of variance. Different population types including F2, 
back-cross, recombinant inbreds derived from F2 or back-cross after a given generations 
of selflng can be analyzed. Automatic grouping and ordering of numerous markers by 
. metric multidimensional scaling is possible. MAPL is available from the Institute of 
25 Statistical Genetics on Intemet (ISGI), Yasuo, UKAI, http://web.bhaaLac.uk/g.g.seaton/. 

m 

Another program that can be used for linkage analysis is R/qtl. This program 
provides an interactive environment for mapping QTLs in experimental crosses. R/qtl 
makes uses of the hidden Markov model (HMM) technology for dealing with missing 
genotype data. R/qtl has implemented many HMM algorithms, with allowance for the 
30. pr^ence of genotyping errors, for backcrosses, intercrosses, and phase-known four-way 
crosses. R/qtl includes fiacilities for estimating genetic maps, identifying genotyping 
errors, and performing single-QTL genome scans and two-QTL, two-dimensional 
genome scans, by interval mapping with Haley-ELnott regression, and multiple 



96 



wo 2004/061616 PCT/US2003/O41613 

imputatiou. R/qtl is available from K^l W. Broman, Johns Hopkins University, 
ht^ ://biosunO 1 .biostat .jhsph.edii/-'kbroman/qtl/. 

Those of skill in the art will appreciate that there are several other programs and 
algorithms that can be used in the steps of the methods of the present invention where 
quantitative genetic analysis is needed, and all such programs and algorithms are v^tbin 
the scope of the present invention. 



5,13.6. MODEL-BASED PARAMETRIC LINKAGE ANALYSIS 
In model-based linkage analysis, (also termed "LOD score" methods or 
1 0 parametric methods), the details of a traits mode of inheritance is being modeled. 
Typically, particular values of the allele frequencies and the penetrance function are 
specified. 



5.13.6.1. INTERVAL MAPPING VIA MAXIMUM LIKELIHOOD / INBRED 
15 POPULATION 

In one embodiment of the present invention, linkage analysis compnses QTL 
interval mapping in accordance with algorithms derived from those first proposed by 

« 

Lander and Botstein, 1989, 'TMapping Mendelian factors underlying quantitative traits 
usingRFLP linkage maps," Genetics 121: 185-199. The principle behind interval 

20 mapping is to test a model for the presence of a QTL at many positions between two 
mapped marker loci. The model is fit, and its goodness is tested using a technique such 
as the TTTia vimntn likelihood method Maximum likelihood theory assumes that when a 
QTL is located between two bialleUc markers, the genotypes (i.e. AABB, AAbb, aaBB, 
aabb for doubled haploid progeny) each contain mixtures of quantitative trait locus (QTL) 

25 genotypes. Maximum likelihood involves searching for QTL parameters that give the 
best ^roximation for quantitative trait distributions that are observed for each marker 
class. Models are evaluated by computing the likelihood of the observed distributions 
with and without fitting a QTL effect 

In some embodiments of the present invention, linkage analysis is performed 
30 using the algorithm of Lander, as implemmted in programs such as GeneHunter. See, for 
example, Kruglyak et aL, 1996, Parametric and Nonparametric Linkage Analysis: A 
Unified Multipoint Approach, AmOTcan Journal of Human Genetics 58:1347-1363, 
Kruglyak and Lander, 1998, Journal of Computational Biology 5:1-7; Kruglyak, 1996, 

97 



wo 2004/061616 PCTAJS2003/041613 

American Journal of Human Genetics 58, 1347-1363. la such embodiments, unlimited 
mark^ may be used but pedigree size is constrained due to computational limitations. In 
other embodiments, the MENDEL software package is used. (See 
http:/A)imas.dcrt.nih.gov/linkage/ltools.html). In such embodiments, the size of the 
5 pedigree can be unlimited but the number of markers that can be used in constrained due 
to computational limitations. The techniques described in this Section typically require 
an inbred population. 

5.13.6.2. lOTERVALMAPPmG USING LmEAR REGRESSION /B^ 
10 POPULATION 

In some embodiments of the present invention, interval mapping is based on 
regression methodology and gives estimates of QTL position and eflFect that are s imil ar to 
those given by the maximum likelihood method Since the QTL genotypes are unknown 
in mapping based on regression methodology, genotypes are replaced by probabihties 
IS estimated using genotypes at the nearest flanking markers or all linked markers. See, e.g., 
Haley and Knott, 1992, Heredity 69, 315-324; and Jiang and Zmg, 1997, Genetica 
1 01 :47-S8. The techniques described in this Section typically require an inbred 
■■ population. 

20 5.13.7. MODEL-FREE NONPARAMETMC LINKAGE ANALYSIS 

Model-based linkajge analysis (classical linkage analysis) calculates a lod score 
that represents the chance that a given loci in the genome is . genetically linked to a trait, 
assuming a specific mode of inheritance for the trait. Namely the allele frequencies and 
penetrance values are included as parameters and are subsequently estimated. In the case 

« 

25 ofcomplex diseases, it is often dif&cult to model with any certainty all the causes of 

fanulial aggregation. In other words, when the trait exhibits non-mendelian segregation it 
can be difficult to obtain reliable estimates of penetrance values, including phenocopy 
risks, and the allele frequency of the disease mutatioiL Indeed it can be the case that 
different mutations at different lod have different kinds of effect on susceptibility, some 

30 major and some minor, some dominant and some recessive. If different modes of 

• ■ * . 

transmission are operative in diflfer^t families, or if different loci interact in the same 
family, then no one transmission model may be ^propriate. It is conceivable that if the 

98 



wo 2004/061616 PCT/US2003/041613 

transmission model for a linkage analysis is specified incorrectly the results produced 
from it will not be valid nor interpretable. 

As a result of the difficulties described above, a variety of methods have been 
developed to test for linkage without the need to specify values for the parameters 
S defimng the transmission model, and these methods are termed model-free linkage 
analyses (meaning that they can be appUed without regard to the true transmission 
model). Such methods are based on the premise that relatives who are similar with 
respect to the phenotype of interest will be similar at a marker locus, sharing identical 
marker alleles, only if a locus imderlying the phenotype is linked to the marker. 

10 Model-free linkage analyses (allele-sharing methods) are not based on 

constructing a model, but rather on rejecting a modeL Specifically, one tries to prove that 
the inheritance pattem of a chromosomal region is not consistent with random Mendelian 
segregation by showing that a£fected relatives inherit identical copies of the region more 
often then expected by chance. Affected relatives should show excess allele sharing in 

15 regions linked to the QTL even m the presence of incomplete penetrance, phenocopy, 
genetic heterogeneity, and high-fi:«quency disease alleles. 

5.13.7.1. IDENTICAL BY DESCENT - AFFECTED PEDIGREE MEMBER (IBD- 

APM) ANALYSIS / OUTBRED POPULATION 

20 In one embodiment, nonparametric linkage analysis iavolves studying affected 

relatives 46 (Fig. 1) in a pedigree 3 10 to see how often a particular copy of a 
chromosomal region is shared identical-by descent (IBD), that is, is inherited firom a 
common ancestor within the pedigree. The firequency of IBD sharing at a locus can then 
be conipared with random expectation. An identity-by-descent afifected-pedigree- 

.25 member (IBD-APM) statistic can be defined as: 

T(5) ^"^Xfjis) . 
iJ 

where Xtj(s) is the number of copies shared IBD at position s along a chromosome, and 
where the sum is taken over all distinct pairs (iJ) of affected relatives 46 in a pedigree 
'310. The results ftom multiple famiUes can be combined in a weighted sum 7l[s). 
30 Assuming random segregation, T^s) tends to a nonnal distribution with a mean /i and a 
variance a that can be calculated on the basis of the kinship coefficients of the relatives 
compared. See, for example, Blackwelder.and Elston, 1985, G^et Epidemiol. 2, p.8S; 

99 



wo 2004/061616 PCTAJS2003/041613 

Whittemore and Halpem, 1994, Biometrics 50, p. 1 18; Weeks and Lange, 1988, Am. J. 
Hum. Genet. 42, p. 315; and Elston, 1998, Genetic Epidemiology 15, p. 565. Deviation 
ftom random segregation is detected when the statistic (T-ft)/a exceeds a critical 
threshold. The techniques in this section typically use an outbred population. 

5 

5.13.7.2. AFFECTED SIB PAIR ANALYSIS / OUTBRED POPULATION 

Affected sib pair analysis is one form of IBD-APM analysis (Section 5.13.7.1). 
For example, two sibs can show IBD sharing for zero, one, or two copies of any locus 
(with a 25%-50%-25% distribution expected under random segregation). If both parents 

10 are available, the data can be partitioned into separate IBD sharing for the maternal and 
paternal chromosome (zero or one copy, with a 50%-50% distribution expected under 
random segregation). In either case, excess allele sharing can be measured with a test 
In the ASP approach, a large number of small pedigrees (affected siblings and their 
parents) are used. DNA samples are collected from each organism and genotyped using a 

15 . large collection of markers (e.g., microsatellites, SNPs). Then a check for functional 

polymorphism is performed. See, for example, Suarez et al^ 1978, Ann. Hum. Genet 42, 
p.87; Weitkamp, 1981, N.Engl. J. Med.. 305, p,1301; Knapped a/., 1994, Hum. Hered. 
44, p. 37; Hohnans, 1993, Am. J. HunL Genet 52, p. 362; Rich et al, 1991, 
Diabetologica 34, p. 350; Ow^bach and Gabbay, 1994, Am. J. Hum. Genet 54, p. 909; 

20 and Berrettini et al, Proc. Natl. Acad. ScL USA 9 1, p. 59 1 8. For more information on 
Sib pair analj^is, see Hamer et al^ 1993, Science 261, p. 321. 

. In some embodiments, ASP statistics that test whether affected siblings pairs have 
a mean proportion of rnarker genes identical-by-descent that is > 0.50 were computed. 
See, for example, Blackwelder and Elston, 1985, Genet Epidemiol. 2, p. 85. In some 
25 embodiments, such statistics are computed using the SIBPAL program of the SAGE 
package. See, for example, Tran et al 1991, (SIB-PAL) Sib-pair linkage program 
(Elston, New Orleans), Version 2.5. These statistics are conq)uted on all possible 
affected pairs. In some embodiments the number of degrees of freedom of the t test is set 
at the number of independent affected pairs (defined per sibship as the number of affected 

30 individuals minus 1) in the sample instead of the number of all possible pairs. See, for 

. - •* • 

example, Suarez and Eerdewegh, 1984, Am. J. Med. Genet 18, p. 135. The techniques in 
this section typically use an outbred population. 

100 



wo 2004/061616 PCT/US2003/041613 

5.13.7.3. IDENTICAL BY STATE - AFFECTED PEDIGREE MEMBER (IBS- 

APM) ANALYSIS / OUTBRED POPULATION 

In some instances, it is not possible to tell whether two relatives inherited a . 
chromosomal region IBD, but only whether fliey have the same alleles at genetic markers 

5 in the region, tibiat is, are identical by state (IBS). IBD can be inferred ftom IBS when a 
dense collection of highly polymorphic markers has been examined, but the early stages 
of genetic analysis can involve sparser maps with less informative markers so that IBD 
status can not be determined exactly. Various methods are available to handle situations 
in which IBD cannot be inferred from IBS. One method infers IBD sharing on the basis 

10 of the marker data (expected identity by descent affected-pedigree-member; IBD-APM). 
See, for example, Suarez et aL, 1978, Ann. Hum. Genet. 42, p. 87; and Amos et aL, 1990, 
Am J. Hum. Genet. 47, p. 842, Another method uses a statistic that is based explicitly on 
IBS sharing (an IBS-APM method). See, for example, Weeks and Lange, 1988, Am J. 
Hum. Genet. 42, p. 315; Lange, 1986, Am. J. Hum. Genet. 39, p. 148; Jeunemaitre et al, 

15 1992, Cell 71, p. 169; and Pericak-Vance et al, 1991, Am. J. Hum. Genet. 48, p. 1034. 

In one embodiment the IBS-APM techniques of Weeks and Lange, 1988, Am J. 
Huin. Genet 42, p. 315; and Weeks and Lange, 1992, Am. J. Hum. Genet 50, p. 859 are 
used. Such techniques use marker information of affected individuals to test whether the 
affected persons within a pedigree are more similar to each other at the marker locus than 
20 would be expected by chance. In some embodiments, the marker similarity is measxured 
in terms of identity by state. In some embodiments, the APM method uses a marker allele 
frequency weighting ftmction, ^(p), where p is the allele frequency, and the APM test 
statistics are presented separately for each of three different weighting functions,/(^^=l , 

fip) = 1/ » 2nd f{p) = lip. Whereas the second and third functions render the sharing 

25 of a rare allele among affected persons a more significant event, the first weighting 

ftmction uses the allele frequencies only in calculation of the expected degree of marker 
allele sharing. The third ftmction, ^(p) - Vp, can lead (more frequently than the first two) 
to a non-normal distribution of the test statistic. The second ftmction is a reasonable 
compromise for generating a normal distribution of the test statistic while incorporating 

30 an allele frequency ftmction. In some instances, the APM test statistics are s^itive to 
marker locus and allele frequency misspecification. See^ for example, Babron^ et al, 
1993, Genet. Epidemiol. 10, p. 389. In some embodiments, allele frequencies are 
estimated from the pedigree data using the method of Boehnke, 1991, Am J. Hum. Genet 



wo 2004/061616 PCT/US2003/O41613 

48, p. 22, or by studying alleles. See, also, for example, Berrettim et al, 1994, Proc. Natl. 
Acad Sci. USA 91, p. 5918. 

In some embodiments, the sigoificance of the APM test statistics is calculated 
from the theoretical (normal) distribution of the statistic. In addition, num^us replicates 
S (e^., 10,000) of these data, assuming independent inheritance of maiker alleles and 
disease (z.^., no linkage), are simulated to assess the probability of observing the actual 
results (or a more extreme statistic) by chance. This probability is the empirical F value. 
Each replicate is gen^ated by simulating an unlinked marker segregating througji the . 
actual pedigrees. An APM statistic is generated by analyzing the simulated data set 
10 exactly as the actual data set is analyzed. The rank of the observed statistic in the 

distribution of the simulated statistics determines the empirical P value. The techniques 
in this section typically use an outbred population. 



5.13.7.4. QUANTITATIVE TRAITS 

15 Model-free liokage analysis can also be applied to quantitative traits. An 

. approach proposed by Haseman and Elston, 1972, Behav. Genet 2, p. 3, is based on the 
, notion that the phenotypic similarity between two relatives should be correlated with the 
number of alleles shared at a trait-causing locus. Formally, one performs regression 
analysis of the squared difference A in a trait between two relatives and the number x of 

20 alleles shared IBD at a locus. The approach can be suitably generalized to other relatives 
(Blackwelder and Elston, 1982, Commun. Stat. Theor. Methods 1 1, p. 449) and 
multivariate phototypes (Amos et aL, 1986, Genet. Epidemiol. 3, p. 255). See also. 
Marsh et al, 1994, Science 264, p. 1152, and Morrison et al, 1994, Nature 367, p. 284; 
Amos, 1994, Am. J. Hum Genet. 54, p. 535; and Elston, Am J. Hum. Genet. 63, p. 931. 

25 

5.14. ASSOCIATION ANALYSIS 
This section describes a number of association tests that can be used in the present 
inventioxL Association studies can be done with samples of pedigrees or samples of 
unrelated individuals. Further, association studies can be done for a dichotomous trait 
30 (e.g., disease) or a quantitative trait See, for example, Nepom and Ehrlich, 1991, AnnxL 
Rev. Immunol. 9, p. 493; Strittmatter and Roses, 1996, Annu. Rev. NeuroscL 19, p. 53; 
Vooberg et al., 1994, Lancet 343, p. 1535; Zoller et al., Lancet 343, p. 1536; Bennet et al., 
1995, Nature Genet. 9, p. 284; Grant et al., 1996, Nature Genet 14, p. 205; and Smith et 

102 



wo 2004/061616 PCTAJ!S2003/041613 

al.> 1997, Science 277, p. 959. As such, association studies test whether a disease and an 
allele show correlated occurrence across the population, whereas linkage studies 
detennine whether there is correlated transmission within pedigrees. 

Whereas linkage analysis involves the pattern of transmission of gametes &om 
5 one generation to the next, association is a property of the population of gametes. 
Association exists between alleles at two loci if the frequency, with which they occur 
within the same gamete, is different from the product of the allele frequencies. If this 
association occurs between two linked loci, then utilizing the association will allow for 
fine localization, since the strength of association is in large part due to historical 

1 0 recombinations rather than recombination within a few generations of a family. In the 
sinq)lest scenario, association arises when a mutation, which causes disease, occurs at a 
locus at some time, to. At that time, the disease mutation occurs on a specific genetic 
background composed of the alleles at all other loci; thus, the disease mutation is 
completely associated with the alleles of this background. As time progresses, 

15 recombination occurs between the disease locus and all other loci, causing the association 
to diminish. Loci that are closer to the disease locus will generally have higher levels of 
association, with association rapidly dropping off for markers further away. The reliance 
of association on evolutionary history can provide localization to a region as small as 50- 

• * 

75 kb. Association is also called linkage disequilibrium. Association (linkage 
20 disequilibrium) can exist between alleles at two loci without the loci being linked. 

Two forms of association analysis are discussed in the sections below, population 
based association analysis and family based association analysis. More generally, those 
of skill in the art with appreciate that there are several different forms of association 
analysis, and all such forms of association analysis can be used in steps of the present 
25 invention thatrequire the use of quantitative genetic analysis. 

hi some embodiments, whole genome association studies are performed in 
accordance with the present inventiorL Two methods can be used to perform whole- 
genome association studies, the "direct-study" approach and the ''indirect-study" 

• • • * 

qsproach. In the direct-study q)proach, all common fimctional variants of a given gene 
30 are catalogued and tested directiy to determine whether there is an increased prevalence 
(association) of a particular fimctional variant in affected individuals within the coding 
region of the given gene. The "indirect-study" approach uses a very dense marker map 
that is arrayed across both coding and noncoding regions. A dense panel of 



103 



wo 2004/061616 PCT/US2003/041613 

polymorphisms {e,g., SNPs) from such a map can be tested in controls to identify 
associations that nairowly locate the neighborhood of a susceptibility or resistance gene. 
This strategy is based on the hypothesis that each sequence variant that causes disease 
must have arisen in a particular individual at some time in the past, so the specific alleles 
5 for polymorphisms (haplotype) in the neighborhood of the altered gene in that individual 
can be inherited in all of his or her descmdants. The presence of a recognizable ancestral 
lu5)lotype therefore becomes an indicator of the disease-associated polymorphism. In 
actuality, some of the alleles will be in association while others will not due to 
recombination occuiring between the mutation and other polymorphisms. 

10 

5.14.1. POPULATION-BASED (MODEL-FREE) ASSOCLVTION ANALYSIS 
In population-based (model-free) association studies, allele frequencies in 
a£Qicted organisms are contrasted with allele frequencies in control organisms in order to 
determine if there is an association betwem a particular allele and a complex trait. 

15 Population-based association studies for dichotonious traits are also referred to as case- 
control studies. A case-control study is based on the comparison of unrelated affected 
and unaffected individuals from a population. An allele A at a gene of interest is said to 
be associated with the phenotype if it occurs at significantly higher frequency among 
affected compared with control individuals. Statistical significance can be tested by a 

20 number a methods, including, but not limited to, logistic regressiqnv Association studies 
are disciissed in Lander, 1996, Science 274, 536; Lander arid Schork, 1994, Science 265, 
2037; Risch and Merikangas, 1996, Science 273, 1516; and Collins et aL, 1997, Science 
278, 1533. 

As is true for case-control studies generally, confounding is a problem for 
25 inferring a causal relationship between a disease and a measured risk factor using 
population-based association analysis. One approach to deal with confounding is the 
matched case-control design, where individual controls are matched to cases on potential 
confounding factors (for example, age and s^) and the matched pairs are then examined 
individually for the risk factor to see if it occurs more frequently in the case than in its 
30 matched control. In some embodiments, cases and controls are ethnically comparable. In 
other words, homogeneous and randomly mating populations are used in the association 
analysis. In some embodiments, the femily-based association studies described below are 



104 



wo 2004/061616 PCT/US2003/041613 

used to minimize the effects of confounding due to genetically heterogeneouis 
populations. See, for example, Risch, 2000, Nature 405, p. 847. 




Family-based association analysis is used in some embodiments of the invention, 
in some embodiments, each affected organism is matched with one or more imaffected 
siblings (see, for example, Curtis, 1997, Ann. Hum. Genet. 61, p. 319) or cousins (see, for 
example, Witte, et al, 1999, Am J, Epidemiol. 149, p. 693) and analytical techniques for 
matched case-control studies is used to estimate effects and to test a hypotheses. See, for 
example, Breslow and Day, 1989, Statistical methods in cancer research I, The analysis of 
case-control studies 32, Lyon: lARC Scientific Publications. The followmg subsections 
describe some forms of family-based association studies. Those of skill in the art will 
recognize that there are numerous forms of family-based association studies and all such 
methodologies can be used in the present invention. 

5.14.2.1. HAPLOTYPE RELATIVE RISK TEST 

In some embodiments, the haplotype relative risk test is used. In the haplotype 
relative risk method, all marker alleles compared arise from the same person. The marker 
alleles that parents transmit to an ajffected offspring (case alleles) are compared with those 
that they do not transmit to such an offspring (control alleles). One can also compare 
transmitted and nontransmitted genotypes. Consider the 2n parents of n affected persons. 
This population can be classified into a fourfold table accordiug to whether the 
transmitted allele is a marker allele (Af) or some other allele M and according to whether 
the nontransmitted allele is similarly Mox M: 



Nontra 



itted allele 



Transmitted allele 



M 



M 



Total 



M 



a 



b 



a+b 



M 



0 



d 



- c+d 



a+c 



b+d 



2n=a+b+c+d 



105 



wo 2004/061616 PCT/US2003/O41613 

To test for association, a detennination is made as to whether the proportion of M 
alleles that are transmitted, a/(a+b), differs significantly from the proportion of M alleles 
that are nontransnutted, a/(a+c). One appropriate statistical test for this determination is 
comparison of (b-c) /(b+c) to a chi-square distribution with one degree of freedom when 
5 the sample is large. 

The row totals for the table above are the numbers of transmitted alleles that are M 

and M , while the column totals are the numbers of nontransmitted alleles that are M and 

M . These four totals can be put into a fourfold table that classifies the 4n parental 
alleles, rather than the 2n parents: 



10 



Marker allele Transmitted Non-transmitted Total 



M a+b a+c 2a4-b+c 

M c+d b+d b+c+2d 

Total 2n 2n 4n 



The haplotype relative risk ratio is defined as (a+b)(c+d)/(a+c)(c+d). A chi- 
square distribution using one degree of freedom can be used to determine whether the 
haplotype relative risk ratio differs significantly from one. See, for example, Rudorfer, et 
al, 1984, Br. J. Chn. Pharmacol. 17, 433; Mueller and Young, 1997, Emery 's Elements of 
1 5 Medical Genetics^ Kalow ed., p. 1 69- 1 75, Churchill Livingstone, Edinburgh; and Roses, 
2000, Nature 405, p. 857, Elson, 1998, Genetic Epidemilogy, 15, p. 565. 



5.14.2.2. TRANSmSSION EQUBLIBRrUM TEST 
In some embodiments, the transmission equilibrium test (TDT) is used. TDT 
20 considers parents who are heterozygous for an allele and evaluates the firequency with 
which that allele is transmitted to affected oflfepring. By restriction to heterozygous 
parents, the TDT differs from other model-free tests for association between specific 
alleles of a polymorphic marker and a disease locus. The parameters of that locus, 
genotypes of sampled individuals, hnkage phase, and recombination frequency are not 
25 specified. Nevertheless, by considering only heterozygous parents, the TDT is specific 
for association between linked lod. 



106 



wo 2004/061616 PCT/US2003/041613 

TDT is a test of linkage and association that is valid in heterogeneous populations. 
It was originally proposed for data consisting of families ascertained due to the presence 
of a diseased child. The genetic data consists of the marker genotypes for the parents and 
child. The TDT is based on transmissions, to the diseased child, from heterozygous 

5 parents, or parents whose genotypes consist of different alleles. In particular, consider a 
biallelic marker with alleles Mi and M2. The TDT counts the number of times, /lu, that 
M1M2 parents transmit marker allele Mi to the diseased child and the number of times, 
niu, that M2 is transmitted. If tiie marker is not linked to the disease locus, i.e. 6 = 0.5, or 
if there is no association between Mi and the disease mutation, then conditional on the 

1 0 number of heterozygous parents, and in the absence of segregation distortion, nu is 
distributed binomially. B(na + «2U 0.5). The null hypothesis of no linkage or no 
association can be tested with the statistic 

JL IDT X 

with statistical significance level approximated using the distribution with one df or 
15 computed exactly with the binomial distribution. When traiismissions from more flian 
. one diseased child per family are included in the TDT statistic, the test is vaUd only as a 

test of linkage. 

I .. ' • • 

* 

Several extensions of the TDT test have been proposed and all such extensions are 
within ttie scope of the present invention. See, for example, Mortin aind Collins, 1998, 
20 Proc. Natl. Acad. Sci. USA 95, p. 1 1389; Terwilliger, 1995, Am J Hum Genet 56, p. 777. 
See also, for example, Mueller and Young, 1997, Emery 's Elements of Medical Genetics, 
Kalow ed., p. 169-175, Churchill Livmgstone, Edinburgh; Zhao et al, 1998, Am. L Hum. 
Genet 63, p. 225; Roses, 2000, Nature 405, p. 857; Spiehnan et al, 1993, Am J. Hum. 
Genet. 52, p. 506; and Ewens and Spiehnan; Am. H. Hum. Genet. 57, p. 455. 

25 

5.14.2.3. SmSHIP-BASED TEST 
In some embodiments, the sibship-based test is used. See, for example, Wiley, 
1998, Cur. Pharmaceut. Des. 4, p. 417; Blackstock and Wefr, 1999, Trends Biotechnol. 
17, p. 121; Kozian and Kirschbaum, 1999, Trends BiotechnoL 17, p. 73; Rockett et al, 
30 Xendbiotica 29, p. 655; Roses, 1994, J. NeuropaflioL Eq). Neurol 53, p. 429; md Roses, 
2000, Nature 405, p. 857. 



107 



wo 2004/061616 PCT/US2003/041613 

5.15. COMPLEX TRAITS 

In some embodiments of the present invention, the term "'complex traif ' refers to 
any clinical trait T that does not exhibit classic Mendelian inheritance. In some 
embodiments, the term "complex trait" refers to a trait that is affected by two or more 
gene loci. In some embodiments, the term "complex trait" refers to a trait that is afifected 
by two or more gene loci in addition to one or more factors including, but not limited to, 
age, sex, habits, and environment.. See, for example. Lander and Schork, 1994, Science 
265: 2037. Such "complex" traits include, but are not limited to, susceptibilities to heart 
disease, hypertension, diabetes, obesity, cancer, and infection. Complex traits arise when 
the simple coirespondence between genotype and phenotype breaks down, either because 
the same genotype can result in different phenotypes (due to the effect of chance, 
environment, or. interaction with other genes) or different genotypes can result in the same 
phenotype. 

In some embodiments, a complex trait is one in which there exists no genetic 

» 

marker that shows perfect cosegregation with the trait due to incomplete penetrance, 
phenocopy, and/or nongenetic factors (e,g., age, sex, enviroimient, aiid aflFect or other 
genes). Incomplete penetrance means that some individuals who inherit a predisposing 
allele may not manifest the disease. Phenocopy means that some individuals who inherit 
no predisposing allele may nonetheless get the disease as a result of enviroimiental or 
random causes. Thus, the genotype at a given locus may affect the probability of disease, 
but not fiilly determine the outcome. The penetrance fimctionXG), specifying the 
probabiUty of disease for each genotype G, may also dq)end on nongenetic factors such 
as age, sex, environment, and other genes. For example, the risk of breast cancer by ages 
40, 55, and 80 is 37%, 66%, and 85% in a woman cairyiog a mutation at the BCRAl locus 
as compared with 0.4%, 3%, and 8% in a noncarrier (Easton et al^ 1993, Cancer Surv. 
18: 1995; Ford et al, 1994, Lancet 343: 692). In such cases, genetic mapping is 
hampered by the fact that a predisposing allele may be present in some unaffected 
individuals or absent in some affected individuals. 

In some embodiments a complex trait arises because any one of several genes may 
result in identical phenotypes (genetic heterogeneity). In cases where there is genetic 
heterogeneity, it may be difficult to determine whether two patients suffer fiom the same 
disease for different genetic reasons until the gaies are m^ped. Examples of complex 



108 



wo 2004/061616 PCT/US2003/041613 

diseases that arise due to genetic hetCTogeneity in humans include polycystic kidney 
disease (Reeders et al, 1987, Human Genetics 76: 348), early-onset Alzheimer's disease 
(George-Hyslop et aL, 1990, Nature 347: 194), maturity-onset diabetes of the young 
(Barbosa a/., 1976, Diabete Metab. 2: 160), hereditarynonpolyposis colon cancer 

5 (Fishel et al, 1993, Cell 75: 1027) ataxia telangiectasia (Jaspers and Bootsma, 1982, 
Proc, Natl Acad. Set U.SA. 79: 2641), obesity, nonalcoholic steatohepatitis (NASH) 
(James & Day, 1998, J. Hepatol 29: 495-501), nonalcoholic fatty liver (NAFL) 
(Younossi, et al, 2002, Hepatology 35, 746-752), and xeroderma pigmentosum (De 
Weerd-Kastelein, Nat, New Biol 238: 80). Genetic heterogeneity hampers genetic 

10 mapping, because a chromosomal region may cosegregate with a disease in some families 
but not in others. 

In still other embodiments, a complex trait arises due to the phenommon of 
polygenic inheritance. Polygenic inheritance arises when a trait requires the simultaneous 
presence of mutations in multiple genes. An example of polygenic inheritance in humans 

15 is one form of retinitis pigmentosa, which requires the presence of heterozygous 

mutation? at the perpherin / KDS and ROMl genes (Kajiwara et al, 1994, Science 26A: 
1 604). It is believed that the proteins coded by RDS and ROMl are thought to interact in 
the photoreceptor outer pigment disc membranes. Polygenic inheritance complicates 
genetic mapping, because no single locus is strictly required to produce a discrete trait or 

20 • a high value of a quantitative trait. 

In yet other embodiments, a complex trait arises due to a high frequency of 

* 

disease-causing allele *T)". A high frequency of disease-causing allele will cause 
difficulties in mapping ev^ a simple trait if the disease-causing allele occurs at higji 

> 

frequency in the population. That is because the expected Mendelian inheritance pattern 
25 of disease will be confounded by the problem that multiple independent copies of D may 
be segregating in the pedigree and that some individuals may be homozygous for D, in 
which case one will not observe linkage between D and a specific allele at a nearby 
genetic marker, because either of the two homologous chromosomes could be passed to 
an affected offspring. Late-onset Alzheimer's disease provides one example of the 
30 problems raised by high frequency disease-causing alleles. Initial linkage studies found 
weak evidence of linkage to chromosome 1 9q, but they were dismissed by many 
observers because the lod score (logarithm of the likelihood ratio for linkage) remained 
relatively low, and it was difficult to pinpoint the linkage with any precision (Pericak- 
Vance et a/., 1991, Am J. Hum. Genet. 48: 1034). The confusion was finally resolved 

109 



Wp 2004/061616 PCT/US2003/041613 

wilii the discovery that the ^olipoprotein E type 4 allele J^pears to be the major 
causative fector on chromosome 19. The high frequency of the allele (about 16% in most 
populations) had interfered with the traditional linkage analysis (Corder et aL, 1993, 
Science 261 : 921). High frequency of disease-causing alleles becomes an even greater 
problem if genetic heterogeneity is present 



5.16. ALGORITHMS FOR ELUCmATING GENES THAT AFFECT A 
COMPLEX TRAIT USING eQTL-cQTL OVERLAP 

The present invention provides additional mefliods for associating a gene with a 
10 complex trait. Fig. 19, discloses one such method. Referring to Fig. 19, the first step is to 
assemble startmg data (step 1902). The starting data includes the gene expression data 
44, marker data 70, and genotype and pedigree data 68 as described in Section 5.1 in 
conjunction with Fig. 1. Marker data 70 includes genome annotation information (e.g., 
where a gene is located within the genome). In some embodiments, rather than using 
1 5 gene expression data 44, data such as protein expression levels in a plurality of organisms 

• - 

under study is used. In some embodiments, gene expression data 44 is collected firom 
multiple different tissue types. In addition, in some embodiments, phenotypic data is 
gathered in step 1902. The phenotypic data 95 differs from gene expression data 44 in 
the sense that phenotypic data 95 includes quantitative measurements of traits other than 

20 cellular constituent quantities classical phenotypes). Thus in mice, for example, 
phenotypic data 95 includes data for clinical traits such as subcutaneous fat pad mass, 
perimetrial &t pad mass, omental fat pad mass, and adopisity. In plants, for example, 
phenotypic data 95 includes data for clinical traits such as barren plants, brittle stalks, 
yield, disease resistance, drydown, early growth, growing degree units (GDU), GDU to 

25 physical maturity, GDU to shed, GDU to silk, harvest moisture, plant height, protein . 
rating, root lodging, seedling vigor, grain composition amino acids, and grain 
composition carbohydrates. These clinical traits are defined in United States Patent 
6,368,806 to Openshaw et al Those of skill in the art will appreciate that there are a 

• ■ 

large nmnber of other possible clinical traits and all such traits are within the scope of the 
30 present inventioiL Such clinical traits may include, but are not limited to, measurements 
.suph as life span, presence or absence of a particular disease {e\g. a disease associated 
with a complex trait), bone density, cholesterol level, obesity, blood sugar level, eye 
color, blood type, coordinatioa 



110 



wo 2004/061616 PCT/US2003/041613 

Once startmg data are assembled, gene expression data 44 is transformed into a 
plurality of expression statistics (e^., expression statistic set 304, Figs. 3 A, 3B) for gene 
G. Exemplary expression statistics include, but are not limited to, the mean log ratio, log 
intensity, or background-corrected intensity for gene G. Each expression statistic (eg. 

5 expression statistic 308, Fig. 3 A) represents an expression value for a gene G. In one 
embodiment, each expression value is a normalized expression level measurement for 
gene G in an organism in a plurality of organisms under study. In one embodiment, 
noimalization module 72 (Fig. 1) is used to normalize the expression level measur^ent 
for gene G. In some embodiments, each expression level measurement is determined by 

10 measuring an amount of a cellular constituent encoded by the gene G in one or more cells 
from an organism in the plurality of organisms. In one embodiment, the amoimt of the 
cellular constituent comprises an abundance of an RNA present in one or more ceUs of 
the organism. In one embodiment, the abundance of KNA is measured by a method 
comprising contacting a gene transcript array with the RNA from one or more cells of the 

15 organism, or with a nucleic acid dmved from the RNA. The gene transcript array 

comprises a positionally addressable surface with attached nucleic acids or nucleic acid 
mimics. The nucleic acid mimics are enable of hybridizing with the RNA species or 
with nucleic acid derived from the RNA species. 

In embodiments where the expression level measurement is normalized, any 
20 normalization routine may be used. Representative normalization routines include, but 
are not limited to, Z-score of intensity, median intensity, log median intensity, Z-score 
standard deviation log of intensity, Z-score mean absolute deviation of log intensity 
calibration DNA gene set, user normalization gene set, ratio median intensity correction, 

■ 

and intensity background correction. Furthermore, combinations of normalization 
25 routines may be run. Exemplary normalization routines in accordance with the present 
invention are disclosed in more detail in Section 5.3, infra. 

In addition to the generation of expression statistics from gene expression data 44, 
a genetic marker map 78 is generated from genetic markers 70 (Fig. 1; Fig. 19, step 
1906). In one embodiment of the preset invention, a genetic marker map is created 
30 using marker map construction module 74 (Fig. 1). Further, in one embodiment, 

gmotype probabihty dishibutions for the organisrns under study are cona^ Genotype 
probabihty distributions take into account information such as marker information of 
parents, known genetic distances between markers, and estimated genetic distances 
between the markers. Computation of genotype probability distributions generally 

111 



wo 2004/061616 PCT/US2003/041613 

requires pedigree data. In some embodiments of the present invention, pedigree data is 
not provided and genotype probability distributions are not computed. 

Generally, a genetic mark^ map is constructed from a set of g^etic markers 78 
associated with a plurality of organisms 78 of the single species under study. The set of 
5 genetic markers can comprise single nucleotide polymorphisms (SNPs), microsatellite 
markers, restriction fragment length polymorphisms, short tandem repeats, DNA 
methylation markers, sequence length polymorphisms, random amplified polymorphic 
DNA, amplified fragment length polymorphisms, simple sequence repeats, or any 
combination thereof. In some embodiments, genotype data is used to construct the 

10 genetic notarker map. Such genotype data comprises knowledge of which alleles, for each 
marker in liie set of genetic mark^ used to construct the map, are present in each 
organism in the plurality of organisms under study. In some embodiments, the plurality 
of orgainsms under study represents a segregatmg population and pedigree data is also 
used to construct the marker map. Such pedigree data shows one or more relationships 

1 5 between organisms in the plurality of organisms. In some embodiments, the plurality of 
organisms under study comprises an F2 population and the one or more relationships 
between organisms in the plurality of organisms indicates which organisms in the 
pluraHty of organisms are members of the F2 population. 

■ 

Once the expression da;ta has been transformed into corresponding expression 
20 statistics and genetic marker map 78 has been constructed, the data is transformed into a 
structure that associates all marker, genotype and expression data for input into QTL 
analysis software. This stmcture is stored in expression / genotype warehouse 76 (Fig. 1; 
Fig. 19, step 1908). Fig. 3C illustrates an expression / genotype warehouse 76 that is used 
in some embodiments where gene expression / cellular constituent data 44 was measured 
25 from multiple tissue types. 

A quantitative trait locus (QTL) analysis is performed using data corresponding to 
a gene G as a quantitative trait (Fig. 19, step 1910). £q some embodiments of the present 
invention, step 1910 is performed by an embodiment of expression quantitative trait loci 
(eQTL) identification module 2202 (Fig. 22), which is residmt in memory 24 of 
30 computer 20 in system 10 (Fig. 1).. In one embodiment, this QTL analysis is performed 
by genetic analysis module 80 (Fig. 1).^ In one sample, the QTL analysis steps through a 
genetic maik^ map 78 that represents the genome of the single species. Linkages to gene 
G are tested at each step or location along marker m^. In such embodiments, each step 

112 



wo 2004/061616 PCT/US2003/041613 

or location along the length of the marker map is at regularly defined intervals. In some 
embodiments, these regularly defined intervals are defined in Morgans or, more typically, 
centiMorgans (cM). In some embodiments, each regularly defined interval is less than 
100 cM. In oth^ embodiments, each regularly defined interval is less than 10 cM, less 
5 than 5 cM, or less than 2.5 cM. 

In the QTL analysis of step 1910, data coiresponding to gene G is used as a 
quantitative trait. More specifically, the quantitative trait used in the QTL analysis is an 
expression statistic set, such as set 304 (Fig. 3A), that corresponds to gene G. That is, the 
expression statistic set 304 comprises the expression statistic 308 for gene G firom each 

10 organism 306 in the population under study. Fig. 3B illustrates an ^emplary expression 
statistic set 304 in accordance with one embodiment of the present inventioiL Exemplary 
expression statistic set 304 includes the expression level 308 of gene G firom each 
. organism in a plurality of organisms. For example, consider the case where there are ten 
organisms in the plurality of organisms, and each of the ten organisms expresses gene G. 

IS In this case, expression statistic set 304 includes ten entries, each entry corresponding to a 
different one of the ten organisms in the pluraHty of organisms. Further, each entry 
represents the expression level of gene G in the organism represented by the entry. So, 
entry "1" (308-G-l) corresponds to the expression level of gene G in organism 1, entry 
, (308-G-2) corresponds to tiie expression level of gene G in organism 2, and so forth^ 

20 Expression statistic set 304 comprises a plurality of expression statistics 308 for gene G. 

hi one embodiment of the present invention, the QTL analysis (Fig. 19, step 1910) 
comprises: (i) testing for linkage betwem (a) the genotype of the plurality of organisms at 
a position in the genome of the single species and (b) the plurality of expression statistics 
for gene G expression statistic set 304), (ii) advancing the position in the genome by 

25 an amount, and (iii) repeating steps (i) and (ii) until the genome has been tested In some 
embodiments, the amount advanced in each instance of (ii) is less than about 100 
centiMorgans, less than about 10 centiMorgans, less than about 5 centiMorgans, or less 
than about 2.5 centiMorgans. In some embodiments, the testing comprises performing 
linkage analysis (Section 5.13) or association analysis (Section 5.14) that generates a 

30 statistical score for the position in the genome of the single species. As detailed below, in 
some embodiments, the testing is linkage analysis and the statistical score is a logarithm 
. of the odds (led) score. Thus, in some embodiments, an eQTL identified in processing 
stq> 1910 is represented by a lod score that is greater than about 2.0, greater than about 
3.0, greater than about 4.0, or greater than about 5.0. 

113 



wo 2004/061616 PCT/US2003/041613 

In situations where pedigree data is not available, genotype data fix)m each of the 
organisms 46 (Fig. 1) for each marker in genetic marker map 78 may be compared to 
each quantitative trait (expression statistic set 304) using allelic association analysis, as 
described in Section 5.14, supra^ in order to identify QTL that are linked to each 
5 expression statistic set 304. In one form of association analysis, an aflfected population is 
compared to a control population, hi particular, haplotype or allelic frequencies in the 
affected population are compared to haplotype or allelic frequencies in a control 
population in order to determine v/hether particular haplotypes or alleles occur at 
significantly higher frequency amongst afifected compared with control samples. 
10 Statistical tests such as a chi-square test can be used to determine whether there are 
differences in allele or genotype distributions. 

In some embodiments, testing for linkage between a given position in the 
chromosome and the expression statistic set 304 comprises correlating differences in the 
expression levels found in the expression level statistic with differences in the genotype at 

15 the given position using single marker tests (for example using Mests, analysis of 

variance, or simple linear regression statistics). See, e.g., Statistical Methods, Snedecor 
and Cochran, Iowa State University Press, Ames, Iowa (1985). However, there are many 
other methods for testing for linkage between expression statistic set 304 and a given 
position in the chromosome. In particular, if e:q)ression statistic set 304 is treated as the 

20 phenotype (in this case, a quantitative phenotype), then methods such as those disclosed 
in Doerge, 2002, Mapping and analysis of quantitative trait loci in experimental 
populations, Nature Reviews: Genetics 3:43-62, may be used. Concerning steps (i) 
through (iii) above, if the genetic length of the genome is N cM and 1 cM steps are used, 
then N different tests for linkage are performed on the given chromosome. 

25 In some embodiments, flie QTL data produced from QTL analysis 1910 comprises 

« 

a logarithm of the odds score (lod) computed at each position tested in the genome imder 
study. A lod score is a statistical estimate of whether two loci are likely to lie near each 
other on a chromosome and are therefore likely to be genetically linked. In the present 
case, a lod score is a statistical estimate of whether a given position in the genome under 
30 study is linked to the quantitative trait corresponding to a given gene. Lod scores are 

iuither described in Section 5.4, siq?ra. A lod score of three or more is generally taken to 
indicate that two loci are genetically linked. The generation of lod scores requires 
pedigree data. Accordingly, in embodiments in which a lod score is generated, 
processing step 1910 is essentially a Unkage analysis, as described in Section 5.13, with 

114 



wo 2004/061616 PCT/US2003/041613 

tiie exception that the quantitative trait under study is derived from data, such as cellular 
constituent expression statistics, rather than classical phenotypes such as eye color. Jn 
situations where pedigree data is not available, genotype data from each of the organisms 
46 (Fig. 1) for each marker in genetic marker map 78 may be compared to each 
S . quantitative trait (eg.» expression statistic set 304) using association analysis, as 
described in Section 5.14, supra^ in order to identify QTL that are linked to the 
quantitative trait 

In some embodiments, processing step 1910 yields a data structure that includes 
all positions 86 (Fig. 1) in the genome of the organisms 46 that were tested for linkage to 

10 the expression statistic set 304 (quantitative trait 84) in step 1910. In one embodiment, 
this data stmcture is an entry in data structure 82 (Fig. 1). Positions 86 are obtained from 
genetic marker map 78. For each position 86, genotype data. 68 provides the genotype at 
position 86 for each organism in the plurality of organisms under study. For each such 
position 86 analyzed by QTL analysis 1910, a statistical measure {e.g,^ statistical score 

IS 88), such as the maximum lod score between the position and the expression statistic set, 
is provided by processing step 1910. Thus, processing stq) 1910 yields all the positions 
in the genome of the organism of interest that are linked to the expression statistic set 304 
• tested in step 1910. Such positions are referred to as the eQTL for the linked gene G 
tested in step 1910. 

20 ' In processing step 1912, a clinical quantitative trait loci (cQTL) that is linked to a 

clinical trait T is identified using QTL analysis. In some embodiments of the present 
invention, step 1912 is performed by an embodiment of clinical quantitative trait loci 
(cQTL) identification module 2204 (Fig. 22). In some embodiments, a phenotypic 
statistic set 2102 for the clinical trait T serves as the clinical trait used in the QTL 

25 analysis. Fig, 21 illustrates exemplary phenotypic statistic sets 2102 that are a stored as 
phenolic data 95 in manory 24 within system 10 (Fig. 1). In Fig. 21, each phenotypic 
statistic set 2102 includes the phenotypic value for a different organism in a plurahty of 
organisms und^ study. As used herein, a phenotypic value is any form of measurement 
of a phenotypic trait For example, if the phenotypic trait is cholesterol level in the 

30 organism, the phenotypic value may be milligrams of cholesterol per Utor of blood. 

hi one embodiment, processing step 1912 comprises a classical form of QTL 
analysis in which a phenotypic trait is quantified In some embodiments, processing step 
1912 ^ploys a whole genome search of genetic markers using marker map 78. For each 

115 



wo 2004/061616 PCTAJS2003/041613 

such position 86 in the genome that is analyzed by QTL analysis 1912, processing step 
1912 provides a statistical measure (eg., statistical score 88), such as the maximum lod 
score between the position and the phenotypic statistic set 2102. Thus, processing step 
1912 yields all the positions in the genome of the organism of interest that are lixiked to 
S the expression statistic set 304 tested in step 1912. Such embodiments of processing step 
were first described by Lander and Botstein in Genetics 121, 174-179 (1989). They are 
also described in International AppUcation WO 90/046S1, hitemiational Application WO 
99/13107, Lander and Schork, Science 265, 2037-2048 (1994), andDoerge, Nature 
Reviews Genetics 3, 43-62, (2002). In other embodiments of processing step 1912, 
10 association analysis, as described in Section 5.14 is used rather than linkage analysis. 
Association analysis does not require pedigree data. 

. In one embodiment of the present invention, the QTL anal5^is (Fig. 19, step 1912) 
comprises: (i) testing for linkage between (a) the genotype of a plurality of organisms at a 
position in the genome of a single species and (b) the phenotypic statistic set 2102 (e.g., 

1 5 pluraUty of phenotypic values), (ii) advancing the position in the genome by an amount, 
and (iii) repeating steps (i) and (ii) until the genome has been tested. In some 
. embodiments, the amount advanced in each instance of (ii) is less than about 100 
centiMorgans, less than about 10 centiMorgans, less than about 5 centiMorgans, or less 
than about 2.5 centiMorgans. in some embodiments, the testing comprises perfomiing 

20 linkage analysis (Section 5.13) or association analysis (Section 5.14) that generates a 
statistical score for the position in the genome of the single species. In some 
embodiments, the testing is linkage analysis and the statistical score is a logarithm of the 
odds (lod) score (Section 5.4). Thus, in some embodiments, an eQTL identified in 
processing step 1912 is represented by a lod score that is greater than about 2.0, greater 

25 than about 3.0, greater than about 4.0, or greater ttian about 5.0. 

Processing step 1910 identifies any number of expression quantitative trait loci 
(eQTL) for a gene G wh^as processing stq) 1912 identifies any numb^ of clinical 
quantative trait loci (cQTL) for a clinical trait T. In processing step 1914, the question is 
asked whether an eQTL firom processing step 1910 colocalizes with a cQTL fi-om 
30 processing step 1912 at the same point in the genome. In some embodiments of the 

preisent inveation, processing step 1910 is performed by an embodiments of determination 
module 2206 (Fig. 22). In some raibodiments, an eQTL and a cQTL are considered 
colocalized if they fall within about 50 cmtiMorgans (cM) of each other within the 
genome of the species under study. In some embodiments, an eQTL and cQTL are 

116 



wo 2004/061616 PCT/US2003/041613 

considered colocalized if they fall within about 40 cM, about 30 cM, about 20 cM, about 
IS cM or about lOcM of each other within the genome of the species under study. In 
some ^bodiments, an eQTL and cQTL are considered colocalized if they &11 within 

I 

about 8 cM, about 6 cM, about 4 cM, or about 2 cM of each other within the genome of 
5 the species under study. 

In some embodiments of the present invention, when an eQTL for gene 6 
colocalizes with a cQTL for a clinical trait T (1914-Yes), gene G is associated with the 
clinical trait T (step 1920). If this condition is not satisiBed (1914-No), then another gene 
G in the genome of the species under study is selected and process control returns to step 

10 1910 (Fig. 19). In other embodiments, the condition is imposed that the eQTL for gene G 
colocalizes to the physical location of gene G in the genome (19 16- Yes) before gene G is 
associated with the clinical trait T (step 1920). In other words, the eQTL must 
correspond to the physical location of gene G in the genome of the singje species in order 
for the gene to be linked to a clinical trait T. In some embodiments, an eQTL 

15 corresponds to the physical location of gene G if the eQTL and G colocalize within about 
5cM, 4cM, 3cM, 2cM, IcM, or less in the genome of the single species. In embodiments 
where condition 1916 is imposed, when the condition is not satisfied (1916-No), another 
• gene G in the genome of the species under study is selected and process control returns to 
step 1910. . , 

20 5.17. ALGORITHMS FOR FINDING AN ORTHOLOG IN A FIRST SPECIES TO 
A GENE THAT AFFECTS A COMPLEX TRAIT IN A SECOND SPECIES 

The methods of the present invention are used to identify gene-gene interactions, 

gene-phenotype interactions, and biological pathways linked to complex traits in one 

species (target species, first species) using data firom another species (reference species, 

25 second species). For example, in one embodiment of the present invention, genes 

identified using the processing steps described in Section 5.1, above, and illustrated in 
Fig. 2, are used to identify genes associated with a complex trait in a reference species. 
Then, the genes that are orthologs of the genes identified in the reference species are 
identified in the target species. In another example, genes identified using the processing 

30 steps described m Section 5.16, above, and illustrated in Fig. 19, are used to identify 

genes associated with a complex trait in the reference species. Then, the genes that are • - 
the orthologs of the genes id^tified in the reference species are identified in a target 
species. Orthologs represent the same protein firom different species. That is, an ortholog 
is a gene that is equivalent in two different species by sharing the same common ancestor. 

117 



wo 2004/061616 PCT/US2003/041613 

Stated differeafly, an ortholog is a fimctional counterpart of a gene in another genome 
. that has arisen from speciation. See, for example. Fitch, 1970, Systematic Zoology 
19:99413. 



S 5.17.1. FINDING ORTHOLOGS USING SEQUENCE-BASED METHODS 

In one embodiment of the present invention, genes in a target species that are 
homologous to genes in a reference species are identified using the steps outlined in Fig. 
25. In step 2502, a gene G from a referoice species that was identified using a 
quantitative genetics method is selected. Typical reference species include, but are not 

4 

10 limited to, mouse, monkey, and dog. Typical target species, include, but are not limited 
to, humans. Once a gene G from the reference species has been selected, the remaining 
processing steps in Fig. 25 are used to identify the gene in the target species that is the 
ortholog to the gene G. In processing step 2504, a determination is made as to whether 
the ortholog to gene G can simply be found in curated sequence databases through a 

15 search tool such as LocusLink. See, for example, Pruitt and Maglott, 2001, Nucleic 

Acids Res 29, 137-140 and Pruitt et aL, 2000, Trends Genet 16, 44-47. If so (2504-Yes) 
the process comes to an end with the ortholog to gme G in the target species identified 
(2520). 

If an automated search of curated databases fails (2504-No), then a manual 
20 process is used to identify the ortholog of gene G in the target species processing steps 
2506 through 2516). In this process, a search tool such as Basic Local Alignment Search 
Tool (BLAST) is used. Alternatively a program such as PSI-BLAST, PHI-BLAST, WU- 

BLAST-2, MEGABLAST, BlasfN, and BlastP can be used. See Altschul et aL, 1990, J. 
MoL Biol. 215, 403-410; Altschul et a/.,1996, Methods in Enzymology 266, 460-480; and 
25 KarUn et al, 1993, PNAS USA 90, 5873-5787. 

In processing step 2506, BLAST (or an alternative program) is used to search 
known nucleotide sequences in the target species using the nucleotide sequence of gene 
G. The highest scoring gene sequence in the target species is denoted G'. Next, BLAST 
(or an alternative program) is used to search protein sequences in the target species using 
30 the amino acid translation of G (step 2508). The highest scoring sequence from the 
search in 2508 is designated P'. If P' is not the protein product of G' (25 1 0-No), it is 
likely that tiie ortholog to gene G in the target species has not been identified by 
processing steps 2506 and 2508. That is, the ortholog could be an unidentified gene 



118 



wo 2004/061616 PCT/US2003/041613 

corresponding to P' or alternatively, gene G^ Further still, the ortholog in the target 
species could remain completely unidentified. In some embodiments of the present 
invention, condition 25 1 0 is satisfied if the protein product P ' is in tihe top tier of the 
search results in 2508. For example, in some embodimeats, condition 25 10 is satisfied 
5 when the protein product P' appears anywhere in the top 10 search results in step 2508. 
In other embodiments, condition 25 1 0 is satisfied when the expectation value for the 
protein product P' in the search of step 2508 is le'^ or less, le"^*^ or less, le'^^ or less, or 
le'^^ or less. 

When the condition 2510-No arises, steps 2506 and 2508 may be repeated using a 
10 subset of gene G. Alternatively, steps 2506 and 2508 may be repeated using different 
search settings, as disclosed in Altschul et al, 1990, J. Mol. Biol. 215, 403-410; Altschid 
et al.,l996, Methods in Enzymology 266, 460-480; andKarlin et ed., 1993, PNAS USA 
90, 5873-5787. Further still, other methods for identifying orthologs may be used as 
disclosed in Section 5.17.2, below. 

15 If P' is the protein product of (25 1 0- Yes), it is likely that the ortholog to gene 

G in the target species has been identified by processing steps 2506 and 2608. However, 
as confirmation, additional processing steps (25 12 and 25 14) may be performed. Steps 
2512 and 2514 perform reverse searches in which P' and G' are used to respectively find 
P and G in databases corresponding to the reference species. Thus, for example, in 

20 processing step 25 12, the nucleotide sequence G' is used to identify the sequence G in a 
database of reference species gene sequences. It is expected that this search {e.g., a 
BLAST search or some equivalent to a BLAST search) will identify G in the database of 
reference species sequences with the highest score. Next, in processing step 251.4, the 
protein sequence P' is used to identify the protein P in a database of reference species 

25 protein sequences. It is expected that this search (e.g., a BLAST search or some 
equivalent to a BLAST search) will idaitify P in the database of reference species 
sequences with the highest score. In processing step 2516, a determination is made as to 
whether steps 2512 and 2514 identified P and G. If so (2616-Yes), the ortholog to gene. 
G in the target species has been identified as G' (2520). In some embodiments of the 

30 present invention, P and G are considered identified when they appear in the top tier of 
the respective search results of steps 2512 and 2514. For example, in some embodiments, 
condition 2516 is satisfied (2516-Yes) when the protein product P appears anywhere in 
the top 10 search results in step 2514. In other embodiments, condition 2516 is satisfied 
when ttie expectation value for the protein product P or the gene G has an expectation 

119 



wo 2004/061616 PCT/US2003/041613 

value of le"^ or less, le or less, le"^^ or less, or le"^^ or less in the respective search 
results of steps 2512 and 2514. When condition 2516 is not satisfied (2516-No), it is 
possible that the ortholog to gene G has not he&a identified and other methods for 
identifying orthologs need to be used to identify the orfholog. 

5 

S.n.2. FINDING ORTHOLOGS USING NONSEQUENCE-BASED METHODS 
Another approach for finding orthologs is disclosed in Fig. 26. The approach 
illustrated in Fig. 26 is disclosed in greater detail in United States Patent Application 
09/779,004 entitled 'Tunctionating Genomes with Cross-Species Coregulation" filed 
10 February 7, 2001, which is hereby incorporated by reference in its entirety. The approach 
involves analysis of biological responses (e.g., response profiles) that are obtained or 
provided fix)m noteasurements of one or more aspects of the biological state of a reference 
cell or organism in response to a set of perturbations. The perturbations may include, for 
example, drug exposure, targeted mutations or targeted changes in levels of protein 
15 activity or expression. Other exemplary conditions or perturbations include changes in 
. environmental conditions such as exposure to different conditions of temperature, 
radiation, sunlight, oxygen or aeration to name a few, as well as different nutritional 

■ 

conditions such as growth or incubation of the reference cell or organism in the presence 
or absence of particular nutrients {e,g. one or more particular amino acids and/or sugars). 
20 Still fiirther, exemplary perturbations also include exposxu:e of the reference cell or 
organism to one or more toxins including, but not limited to, exposure to pesticides 
(including, e,g., fungicides or insecticides) or herbicides. 

Particular aspects of the biological state of a cell, such as the transcriptional state, 
the translational state or the activity state are obtained or measured in response to the 

25. plurality of perturbations. Preferably, the measurements are differential measurements of 
the change in graes idratified in response, e,g. , to a drug at certain concentrations and 
times of treatment. The collection of these measurements, which are optionally 
graphically represented, are called herein the "pertubation response" or "drug response" 
or, altematively, the ^'response profile." In preferred raibodiments of the invmtion, a 

30 plurality of different req)onse profiles are obtained or provided for a pluraUty of different 
perturbations or for a pluraUty of cellular constitumts. 

In more detail, a first response profile is first obtained or provided (FIG, 26, step 
2602) for a particular cellular constituent fiiom a reference species under some particular 



120 



wo 2004/061616 PCT/US2003/041613 

set of perturbations. Typically, this cellular constituent studied in step 2602 is a cellular 
constituent corresponding to a gene G from a reference species that was identified using 
quantitative genetics methods a gene verified in processing step 222 of Fig. 2 or a 
gene that has been associated with clinical trait T in processing step 1920 of Fig. 19). In 
S some embodiments, a cellular constituent is a gene, a gene product of the gene (eg. , a 
protein). Thus, a cellular constituent may be, for example, the mRNA or cDNA that 
conresponds to a particular gene. The set of perturbations for which a response profile is 
obtained is referred to herein as the "condition sef ' and is denoted {A}. 

In some embodiments, the number of difierent conditions or perturbations 
10 contained in &e condition set {A} is very large. In some embodiments, {A} includes at 
least 10 different conditions or perturbations, in olher embodiments, {A} includes at least 
50 di£Femit conditions or perturbations, in still other embodiments, {A} includes at least 
100 different conditions or perturbations, in yet other embodiments, {A} includes at least 
500 different conditions or perturbations, and in still other embodiments, {A} includes at 
1 5 least 1 000 different conditions .or perturbations. However, in order to practice the 

methods of the invention most ef&ciently, the response profiles obtained for condition set 
{A} are optionally evaluated (step 2604, Fig. 26) and a **perturbation subset," denoted 
herein as {a}, is selected. Perturbation subset {a} consists of those perturbations or 
conditions in the condition set perturbation set) {A} for which the profiles of gene or 
20 m more preferred embodiments of a plurality of genes, in reference species are maximally 
informative (e.g:, strongest and, preferably, most diverse). 

For example, if several of the profiles obtained for the reference species are 
closely correlated with each other, then typically only one of the conditions or 
perturbations firom this groi^ is selected for further analysis according to the methods of 
25 the present invention. Many techniques of analysis are known in the art that can be used 
to assess the similarity and/or correlation between two or more different profiles. Such 
techniques are disclosed in copending application serial number 09/779,004 entitled 
**Functionating genomes with cross-species coregulation" which is hereby incorporated 
by reference. 

30 In step 2606 (Fig. 26), a response profile is also obtained or provided for a 

particular cellular constituent (e.g., a particular gene or gene product) in a sample of the 
target species (e,g. a cell culture from the target species) xmder a particular set of 
perturbations. The set of perturbations for which responses are obtained or provided for 



121 



wo 2004/061616 



PCT/US2003/041613 



cellular constituents of the target species preferably consist of the same perturbations 
for which responses are obtained or provided for cellular constituents of the first cell or 
organism representing the reference species. That is, the set of perturbations for which 
responses are obtained or provided for cellular constituents of the target cell or species are 
S preferably members of the perturbation set {A} . More preferably, the set of perturbations 
for which responses are obtained or provided for cellular constituents y of the target 
species are preferably members of the perturbation subset {a}. In fact, most preferably, 
the set of perturbations for which a response profile is obtained or provided for cellular 
constituents y of the target cell or ^ecies include all of the perturbations that are 
10 members of the optional perturbation subset {a}. 

In step 2608, the response profiles obtained for a cellular constituent x in the 
reference species and the cellular constituent 3; in the target species are used to evaluate 
the co-regulation of and 3; across a common set of conditions or perturbations, most 
preferably across the perturbation subset {a}. For example, the similarity (e,g., 
1 5 correlation) of the response profile of tlie genes or gene products x and y can be evaluated 
by means of the equation: 



amount of modification of the cellular constituents corresponding to genes x andy, 
respectively, under tiie condition or perturbation L Ifpxy is particularly high then, x and y 
are identified as functionally related and are thus determinied to be candidate orthologs 
(candidate fimctional orthologs). Preferably, the candidate ortholog identified according 

25 to the methods of the invention have a correlation Pxy that is at least 0.5 at least 50%). 
More preferably, the candidate functional orthologs identified according to the methods 
of the invention have a correlation that is at least 0.75 (z.e., at least 75%), 0.8 at least 
80%) or at least 0.85 at least 85%). In fact, the candidate fimctional orthologs 
identified according to the methods of the invention most preferably have a correlation 

30 tiiat is at least 0.9 (i.e., at least 90%). 

Other forms of determining correlation between two datasets, besides the 
correlation coefficient above are well known in the art Indeed, any statistical method for 




20 



in which jc; and 3;/ denote respective changes in e^qnressioii, abundance, activity levels or 



122 



wo 2004/061616 PCT/DS2003/041613 

determining the probability that two datasets are related may be used in accordance with 
the methods of the present invention in order to identify orthologs. Correlation based on 
ranks is also possible, where jc/ and yi are the ranks of the measurement in ascending or 
descending numerical order. See e.g-., Conover, Practical Nonparametric Statistics^ 2^ 
5 ed., Wiley, (1971). Shannon mutual information also can be used as a measure of 
similarity. See e.g., Pierce, An Introduction To Information Tlieory: Symbols, Signals, 
and Noise, Dover, (1980). 

In processing step 2612, a determination is made as to whether there is any 
additional cellular constituents 3^ in the target species available for analysis. In some 

10 embodiments of the present invention, it is desirable to analyze as many cellular 

constituents y as possible in order to maximize the chances of finding the ortholog of 
cellular constituent jc. Whra an additional cellular constituent j; is available (2610-Yes), 
process control returns to step 2606 and 2608 where the additional cellular constituent y is 
evaluated. When no further cellular constituent y is available for analysis is available 

15 (2610-No), process control passes to step 2612 where the cellular constituent;; that had 
the highest similarity to cellular constituent x in an instance of step 2608 is defined as the 
ordiolog to cellular constituent x. 

5.17 J. EXEMPLARY AipPLICATIONS 
20 One aspect of the invention provides a method for associating a gene G in the 

gmome of a single first species with a clinical trait T exhibited by the single first species 
and a single second species. In the method, a gene is found in the single second 
species that is an ortholog of the gene G. In some embodiments, this is accomplished 
using the techniques disclosed in Section 5.17.1 and/or Section 5.17.2. Further, an 
25 e3q)ression quantitative trait loci (eQTL) is identified for gene using a first quantitative 
trait loci (QTL) analysis. This aspect of the present invention uses techniques disclosed 
in United States Patent Application serial number 60/400,522, entitled "Computer 

* > * 

systems and methods that use clinical and expression quantitative trait loci to associate 
genes with traits," filed August 2, 2002, which is hereby incorporated by reference. The 
30 first QTL analysis uses a plurality of expression statistics for gene G' as a quantitative - 
trait Bach expression statistic in the plurality of expression statistics represents an 
e^^ession value for the gene G' in an organism in a plurality of organisms of the single 
second species. A clinical quantitative trait loci (cQTL) that is linked to the clinical trait 

123 



wo 2004/061616 PCT/US2003/041613 

T is ideatified using a second QTL analysis. The second QTL analysis uses a plurality of 
phenotypic values as a quantitative trait Each phenotypic value in the plurality of 
phenotypic values represents a phenotypic value for the clinical trait T in an organism in 
the plurality of organisms of the single second species. Next, a determination is made as 
5 to whether the eQTL and the cQTL colocalize to the same locus in the genome of the 
single second species* For more disclosure on these techniques see, for example, Section 
5.16. When the eQTL and the cQTL colocahze to the same locus, the gene G is 
associated with the clinical trait T m the single first species. 

Another aspect of the present invention provides a computer program product for 

10 use in conjunction with a computer system. The computer program product comprises a 
computer readable storage medium and a computer program mechanism embedded 
therein. The computer program mechanism associates a gene G in the genome of a single 
j5rst species with a clioical trait T exhibited by the single first species and a single second 
species. The computer program mechanism comprises a specialized form of genetic 

15 analysis module 80 (Fig. 1) which is illustrated in Fig. 54. The genetic analysis module 
• 80 in accordance with this aspect of the invention includes an ortholog identification 
module 5401 for finding a gene G' in the single second species that is an ortholog of gene 
G. hi some embodiments, this is accomphshed usiag the techniques disclosed in Section 
5,17.1 and/or Section 5.17.2. The genetic analysis module 80 in accordance with this 

20 aspect of the invention fiulher comprises an expression quantitative trait loci (eQTL) 

identification module 5404 for identifying an expression quantitative trait loci (eQTL) for 
the gene G' using a first quantitative trait loci (QTL) analysis. This aspect of the present 
invention uses techniques disclosed in United States Patent Application serial number 
60/400,522, entitled "Computer systems and methods that use clinical and expression 

25 quantitative trait loci to associate genes with traits," filed August 2, 2002. The fiirst QTL 
analysis uses a plurality of expression statistics for the gene G' as a quantitative trait and 
each expression statistic in the plurality of expression statistics represents an egression 
value for the gene G' in an organism in the plurahty of organisms of the single second 
species. The QTL analysis module in accordance with this aspect of the invention fiirther 

30 comprises a clinical quantitative trait loci (cQTL) idaitification module 5406 for 

identifying a clinical quantitative trait loci (cQTL) that is linked to the clinical trait T 
using a second QTL analysis. The second QTL analysis uses a plurality of phenotypic 
values as a quantitative trait. Each phmotypic value in the plurality of phenotypic values 
rq)resents a phenotypic value for the clinical trait T in an organism in the plurality of 

124 



wo 2004/061616 PCTAJS2003/041613 

organisms of the single second species. The QTL analysis module m accordance with 
this aspect of the invmtion further comprises a determination module 5408 for 
determining whether the eQTL and the cQTL colocalize to the same locus in the genome 
of the single second species. For more disclosure on these techniques see, for example, 
5 Section 5.16. When the eQTL and the cQTL colocalize to the same locus, the gene G is 
associated with the clinical trait T in the single fb:st q)ecies. 

Another aspect of the present invention provides a conq)uter system for 
associating a gene G in the genome of a single first species with a clinical trait T 
exhihited by the single first species and a single second species. The computer system 

10 comprises a central processing unit and a memory coupled to the central processing unit 
The monory stores an ortholog identification module 5402 (Fig. 54), an expression 
quantitative trait loci (eQTL) identification module 5404 (Fig. 54), a clinical quantitative 
trait loci (cQTL) identification module 5406 (Fig. 54), and a determination module 5408 
(Fig. 54). Ortholog identification module 5402 comprises instructions for finding a gene 

15 G' in the single second species that is an ortholog of the gene G. Expression quantitative 
trait loci (eQTL) identification module 5404 comprises instructions for identifying an 
expression quantitative trait loci (eQTL) for the gene G' using a first quantitative trait loci 
(QTL) analysis. The first QTL analysis uses a plurality of expression statistics for gene 
G' as a quantitative trait. This aspect of the present invention uses techniques disclosed 

r 

20 in United States Patent Application serial numher 60/400,522, entitled "Conq^uter 

systems and methods that use clinical and expression quantitative trait loci to associate 
genes with traits," filed August 2, 2002. Each expression statistic in the plurality of 
expression statistics represents an expression value for the gene G' in an organism in a 
plurality of organisms of the single second species. Clinical quantitative trait loci (cQTL) 

25 identification module 5406 comprises instructions for identifying a clinical quantitative 
trait loci (cQTL) that is linked to the clinical trait T usiag a second QTL analysis. The 
second QTL analysis uses a plurality of phenotypic values as a quantitative trait Each 
phenotypic value in the plurality of phenotypic values represents a phenotypic value for 
the clinical trait T in an organism in the plurality of organisms of the single second 

30 species. Determination module 5408 comprises instructions for determining whether the 
eQTL and the cQTL colocalize to the same locus in the genome of the single second 
species. When the eQTL and the cQTL colocalize to the same locus, the gene G is 
associated with the clinical trait T. 



125 



wo 2004/061616 PCT/US2003/O41613 

Another aspect of the present invention provides a method for associating a gene 
G in the genome of a single &st species with a clinical trait T exhibited by the single first 
species and a single second species. In the method, quantitative trait locus data from a 
plurality of quantitative trait locus analyses is clustered to form a quantitative trait locus 

5 interaction m^. This aspect of the present invention uses techniques disclosed in United 
Stat^ Patent Application serial number 60/3 8 1 ,437, entitled "Computer system and 
method for identifying genes and determining pathways associated with traits," filed May 
16, 2002, which is hereby incorporated by refer^ce. Eadi quantitative trait locus 
analysis in the plurality of quantitative trait locus analyses is performed for a gene in a 

10 plurality of genes in the genome of the single second species using a genetic marker map 
and a quantitative trait in order to produce the quantitative trait locus data. For each 
quantitative trait locus analysis, the quantitative trait comprises an expression statistic for 
the gene for which the quantitative trait locus analysis is performed For each organism 
in a plurahty of organisms of the single second species, the genetic marker map is 

1 S constructed firom a set of genetic markers associated with the plurality of organisms of the 
second species. The quantitative trait locus interaction map is analyzed to identify a gene 

■ 

associated with a trait. See, for example, Section 5.1. Then, the gene G in the siugle 
first species that is the ortholog of the gene G' of the single second species is identified, 
. thereby associating a gene G in the genome of the single first species with a clinical trait 
20 T exhibited by the single first species. See for, example. Sections 5.17.1 and 5.17.2. In 
some embodiments, the method further comprises an additional step that is performed 
prior to the clustering step. This additional step comprises performing each of the 
quantitative trait locus analyses in the plurality of quantitative trait locus analyses. 

Another aspect of the present invention provides a computer program product for 
25 use in conjunction with a computer system. The computer program product comprises a 
computer readable storage medium and a computer program mechanism embedded 
therein. The computer program mechanism comprises a clust^ing module 92 (Fig. 1), an 
analysis module (QTL analysis module) 80 Fig. 1, and an ortholog identification module 
5402 (Fig. 54). Clustering module 92 is used for clustering quantitative trait locus data 
30 &om a plurality of quantitative trait locus analyses to form a quantitative trait locus 
ihteraiction m^. Tins aq)ectofthe present invention uses techmquesdiscto^ 
States Patent Application serial number 607381,437, entitled "Computer system and 
method for identifying genes and determining pathways associated with traite," filed May 
16, 2002. See also, for example. Section 5.1. Each quantitative trait locus analysis in the 

126 



wo 2004/061616 PCTAJS2003/041613 

plurality of quantitative trait locus analyses is performed for a gene in a plurality of genes 
in the genome of a single second species using a genetic marker map and a quantitative 
trait in order to produce the quantitative trait locus data. For each quantitative trait locus 
analysis, the quantitative trait comprises an eTqpression statistic for the gene for which the 

5 quantitative trait locus analysis is performed, for each organism in a plurality of 

organisms of the single second species. Further, the genetic marker map is constructed 
from a set of genetic markers associated with the pluraUty of organisms of the single 
second species. The analysis module is for analyzing the quantitative trait locxxs 
interaction ms^ to identify a gene G' associated with a trait exhibited by a single first 

10 species and the single second species. Ortholog identification module 5402 is for finding 
a gene G in the single first species that is an ortholog of the gene G' in the single second 
. species. See for, example, Sections 5.17.1 and 5.17.2. 

Some embodiments of the present invention provide a computer system for 
* associating a gene G in the genome of a single first species with a clinical trait T 

15 exhibited by the single first species and a single second species. The computer system 
comprises a central processing unit and a memory coupled to the central processing unit 
The memory stores a clustering module 90 (Fig. 1), an anal>^is module (QTL analysis 
module) 80 and an ortholog identification module 5402 (Fig. 54). Clustering module 92 
is for clustering quantitative trait locus data from a plurality of quantitative trait locus 

20 analyses to form a quantitative trait locus interaction map. This aspect of the present 
invention uses techniques disclosed in United States Patent AppHcation serial number 
60/38 1,437, entitied "Computer system and method for identifymg genes and determining 
pathways associated with traits " filed May 16, 2002. Each quantitative trait locus 
analysis in the plurality of quantitative trait locus analyses is paformed for a gene in a 

25 plurality of genes in the genome of the single second species using a genetic marker m^ 
and a quantitative trait in order to produce the quantitative trait locus data. For each 
quantitative trait locus analysis, the quantitative trait comprises an expression statistic for 
the gme for which the quantitative trait locus analysis is performed, for each organism in 
a plurality of organisms of the single second species. The genetic marker map is 

30 constructed firom a set of genetic markers associated with the plurality of organisms of the 
. . single second species. Genetic analysis module 80 is for analyzing the quantitative trait 
locus interaction map to identify a gene G' associated with a trait exhibited by the single 

m 

first species and the single second species. Ortholog ideatification module 5402 (Fig. 54) 



127 



wo 2004/061616 PCT/US2003/041613 

is for finding a gene G in the single first species that is an ortholog of the gene G' of the 
single second species. 

Another aspect of the present invention provides a method for identifying a 
quantitative trait locus for a complex trait in a single first species. Some embodiments in 
accordance with this aspect of the present invention use techniques disclosed in United 
States Patent AppUcation s&ddl number 60/382,036, entitled "Computer systems and 
methods for subdividing a complex disease into component diseases," filed May 20, 
2002, which is hereby incorporated by reference. 

The complex trait is exhibited by the single first species and a single second 
species. In the method, a plurality of organisms of the single second species are divided 
into a plurality of subpopulations using a classification scheme that classifies each 
organism in Ihe plurality of organisms of the single second species into at least one of the 
subpopulations. Gne way this can be accomplished is described in Section 15. 17.4, 
below. The classification sch^e uses a plurality of cellular constituent measurements 
from each organism of the single second species. Further, for at least one subpopulation 
in the plurality of subpopulations, the method provides the step of performing quantitative 
genetic analysis on the subpopulation in order to identify a qxxantitative trait locixs for the 

■ • ■ 

complex trait in the single second species. The naefhod further provides the step of 
finding the quantitative trait loci in the single first species that is the ortholog of the 
quantitative trait locus of the single second species, thereby identifying the quantitative 
trait locus for the complex trait in the single first species. See for example. Sections 
5.17.1 and 5.17.2. 

Still another aspect of the present invention provides a computer program product 
for use in conjunction with a computer system. The computer program product comprises 
a computer readable storage medium and a computer program mechanism embedded 
therein. The computer program mechanism comprises.a classification module 5410 (Fig. 
54), a genetic analysis module (QTL analysis module) 80 (Fig. 1), and an ortholog 
identification module 5402 (Fig. 54). Classification module 5410 is for dividing a 
plurality of organisms of a single second species into a plurality of subpopulations using a 

classification scheme that classifies each organism in the plurality of organisms of the 

. • .... 

* • 

single second species.into at least one of the subpopulations. This aspect of the present 
invaition uses techniques disclosed in United States Patent Application serial number 
60/382,036, entitled "Computer systems and methods for subdividing a comply disease 



128 



wo 2004/061616 PCT/US2003/041613 

into con^onent diseases/' filed May 20, 2002. The classification scheme uses a plurality 
of cellular constituent measurements fi-om each organism in the single second species. 
The genetic analysis module is used, for at least one subpopulation in the plurality of 
subpopulations, to perform quantitative genetic analysis on the subpopulation in order to 
5 idaitify a quantitative trait locus for a complex trait that is exhibited by the single second 
species and a single first species. Ortholog identification module 5402 is used for finding 
the quantitative trait locus in the single first species that is the ortholog of the quantitative 
trait locus of the single second species. See, for example. Sections 5.17.1 and 5, 17.2. 

Another aspect of the present invention provides a computer system for . 
10 identifying a quantitative trait locus for a complex trait in a single first species. The 
complex trait is exhibited by the single first species and a single second species. The 
computer system comprises a central processing unit and a memory coupled to the central 
processing unit. The memory is used for storing a classification module 5410 (Fig. 54), a 
genetic analysis module 80 (Fig. 1)^ and an ortholog identification module 5402 (Fig. 54). 

r 

IS Classification module 5410 includes instructions for dividing a plurality of orgianisms of a 
. single second species into a plurality of subpopulations using a classification scheme that 

t 

classifies each organism in the plurahty of organisms of the single second species into at 
least one of the subpopulations. This aspect of the present invention uses techniques 
disclosed in United States Patent Application serial number 60/382,036, entitied 

20 ''Computer systems and methods for subdividing a complex disease into component 
diseases," filed May 20, 2002. The classification scheme uses a plurality of cellular 
constituent measurements fi:om each organism in the single second species. Genetic 
analysis module 80 includes instructions that, for at least one subpopulation in the 
' plurality of subpopulations, performs quantitative genetic analysis on the subpopulation 

25 in order to identify the quantitative trait locus for the complex trait Ortholog 

identification module 5402 comprises instructions for finding the quantitative trait locus 
in the single first species that is the ortholog of the quantitative trait locus in the single 
second species. See, for example. Sections 5.17.1 and 1.17.2. 

30 5.17.4. SUBDIVIDING SCHEMES 

This section describes an q>proach to subdividing a population into 
subpopulations. 



129 



wo 2004/061616 PCT/US2003/041613 

Step 7102. In step 7102 (Fig. 71 A), a trait is selected for study in a species. In 
some embodiments, the trait is a complex trait The species can be a plant, animal, 
human, or bacterial. In some embodiments, the species is human, cat, dog, mouse, rat, 
monkey, pigs, Drosophila, or com. In some embodiments, a plurality of organisms 
5 rq)resenting the species are studied. The number of organism in the species can be any 
number. In some embodiments, the plurality of organisms studied is between 5 and 100, 
between 50 and 200, between 100 and 500, or more than 500. 

In some embodiments, a portion of the organisms under study are subjected to a 
perturbation that affects the trait The perturbation can be environmental or genetic. 

1 0 Exan5)les of environmental perturbations include, but are not limited to, exposure of an 
organism to a test compound, an allergen, pain, hot or cold temperatures. Additional 
examples of environmental perturbations include diet (^.g. a higih fat diet or low fiat diet), 
sleep deprivation, isolation, and quantifying a natural environmental influences (e.g., 
smoking, diet, exercise). Examples of genetic perturbations include, but are not limited 

15 to, the use of gene knockouts, introduction of an inhibitor of a predetermined gene or 
. gene product, N-Ethyl-N-nitrosourea (ENU) mutagenesis, siRNA knockdown of a gene, 
or quantifying a trait exhibited by a plurality of orgamsms of a species. 

The perturbation optionally used in step 7102 is selected because of some 
relationship between the perturbation and the trait. For example, the perturbation could 
20 be the siRNA knockdown of a gene that is thought to influence the trait under study. 

Examples of traits that can be studied in the systems and methods of the present invention 
are disclosed in Section 5.15. 

Step 7104. In step 7104 (Fig. 71 A), the levels of cellular constituents are 
measured &om the plurality of organisms 46 in order to derive gene esqjression / cellular 
25 constituent data 44, The identity of the tissue from which such measurements are made 
will depend on what is known about the trait under study. In some embodiments, cellular 
constituent measurements are made from several different tissues. 

Generally, the plurality of organisms 46 exhibit a genetic variance with respect to 
the trait In some embodiments, flie trait is quantifiable. For example, in instances where 
30 the trait is a disease, the trait can be quantified in a binary form {e.g., "1" if the organism 
has contracted the disease and "0" if the organism has not contracted the disease). In 
some embodiments, the trait can be quantified as a spectrum of values and the plurality of 
organisms 46 will represent several diff^ent values in such a spectnunu In some 

130 



wo 2004/061616 PCT/US2003/041613 

embodiments, the plurality of organisms 46 comprise an untreated unexposed, wild 
type, etc.) population and a treated population exposed, genetically altered, etc). In 
some embodiments, for example, the untreated population is not subjected to a 
pertuibation whereas the treated population is subjected to a perturbation. In some 
5 embodiments, the secondary tissue that is measured in step 7104 is blood, white adipose 
tissue, or some other tissue that is easily obtained fix)m organisms 46. 

In varying embodiments, the levels of between 5 cellular constituents and 100 
cellular constituents, betwem 50 cellular constituents and 100 cellular constituents, 
between 300 and 1000 cellular constituents, between 800 and 5000 cellular constituents, 
1 0 between 4000 and 1 5,000 cellular constituents, between 1 0,000 and 40,000 cellular 
constituents, or more than 40,000 cellular constituents are measured 

In one embodiment, gene expression / cellular constituent data 44 comprises the 
processed microarray images for each individual (organism) 46 in a population under 
study. In some embodiments, such data comprises, for each individual 46, intensity 
15 information 50 for each gene / cellular constituent 48 represented on the inicroarray. In 
some embodiments, cellular constituent data 44 is, in fact, protein expression levels for 
various proteins in a particular tissue in organisms 46 under study. 

In one aspect of the present invention, cellular constituent levels are determined in 

» ' ■ • 

step 7 1 04 by measuring an amount of the cellular constitumt in a predetemiined tissue of 
20 the organism. As used herein, the term ^'cellular constituent" comprises individual genes, 
proteins, mRNA expressing genes, metabolites and/or any other cellular components that 
can affect the trait under study. The level of a cellular constituent can be measured in a 
wide variety of methods. Cellular constituent levels, for example, can be amounts or 
concentrations in the secondary tissue, their activities, their states of modification (e.g., 
25 phosphorylation), or other measurements relevant to the trait und^ study. 

In one embodiment, stq) 7104 comprises measuring the transcriptional state of 
cellular constituents 48 in tissues of organisms 46. The transcriptional state includes the 
identities and abundances of the constituent KNA species, especially mRNAs, in the 
tissue. In this case, the cellular coiistituents are RNA, cKNfA, cDNA, or the like. The 
30 transcriptional state of the cellular constituents can be measured by techniques of 
hybridization to armys of nucleic acid or nucleic acid mimic probes, or by oth^ gene 
expression technologies. Transcript arrays are discussed in Section 5.8. 



131 



wo 2004/061616 PCTAJS2003/041613 

In another embodiment, step 7104 comprises measuring the translational state of 
cellular constituents 48. In this case, the cellular constituents are proteins. The 
translational state includes the identities and abundances of the proteins in the organisms 
46. In one embodiment, whole genome monitoring of protein (Le., the '*proteome/' 
5 Goffeau et aLy 1996, Science 274, p. 546) can be carried out by constructing a microarray 
in which binding sites comprise immobilized, preferably monoclonal, antibodies specific 
to a plurality of protein species encoded by the secondary tissue. Preferably, antibodies 
are present for a substantial fraction of the encoded proteins. Methods for making 
monoclonal antibodies are well known. See, for example, Harlow and Lane, 1998, 

10 Antibodies: A Laboratory Manual^ Cold Spring Harbor, N.Y. In one embodiment, 

monoclonal antibodies are raised against synthetic peptide fragments designed based on 
genomic sequences. With such an antibody array, proteins from the organism are 
contacted with the array and their binding is assayed with assays known in the art. In 
some embodunents, antibody arrays for high-throughput screening of antibody-antigen 

15 interactions are used. See, for example, Wildt et ai. Nature Biotechnology 18, p. 989. 

Alternatively, large scale quantitative protein expression analysis can be 
performed using radioactive {e.g., Gygi a/., 1999, MoL Cell. BioM9, p. 1720) and/or 
stable iostope metabolic labeUng (e.g., Oda et al Proc. Natl. Acad. Sci. USA 96, p. 
6591) followed by two-dimensional (2b) gel separation and quantitative analysis of 
20 separated proteins by scintillation counting or mass spectrometry. Two-dimensional gel 
electrophoresis is well-known in the art and typically involves focusing along a first 
dimension followed by SDS-PAGE electrophoresis along a second dimension. See, e.g,, 
Hames et al, 1990, Gel Electrophoresis of Proteins: A Practical Approach, JRL Press, 
New York; Shevchenko et al, 1996, Proc Nat'l Acad. Sci. USA 93, p. 1440; Sagliocco et 

A 

25 al, 1996, Yeast 12, p. 1519; Lander 1996, Science 274, p. 536; and Naaby-Haansen et 
al, 2001, TRENDS in Pharmacological Science 22, p. 376. Electropherograms can be 
analyzed by numerous techniques, including mass spectrometric techniques, western 
blotting and immunoblot analysis using polyclonal and monoclonal antibodies, and 
internal and N-terminal micro-sequencing. See, for example, Gygi, et al, 1999, Nature 

30 Biotechnology 17, p. 994. In some embodiments, fluorescence two-dimensional 

difTer^ce gel electrophoresis ^IGB) is used. See, for example, Beaumont et al. Life 
Science News 7, 200L In some embodiments, quantities of proteins in the secondary 
tissue of organisms 46 are detennined using isotope-coded afBnity tags (ICATs) followed 
by tandem mass spectrometry. See, for exanq)le, Gygi et al, 1999, Nature Biotech 17, p. 

132 



wo 2004/061616 PCT/US2003/041613 

994. Using such techniques, it is possible to identify a substantial fraction of the proteins 
expressed in a predetennined secondary tissue in organisms 46. 

In other embodiments, step 71 04 comprises measuring the activity or post- 
translational modifications of the cellular constituents in the plurality of organisms 46. 
5 See for example, Zhu and Snyder, Curr. Opin. Chem. Biol 5, p. 40; Martzen et ai, 1999, 
Science 286, p. 1 153; Zhu et al, 2000, Nature Genet 26, p. 283; and Caveman, 2000, J. 
Cell. Sci. 1 13, p. 3543. In some embodiments, measurement of the activity of the cellular 
constituents is facilitated using techniques such as protein microanays. See, for example, 
MacBeath and Schreiber, 2000, Science 289, p. 1760; and Zhu et al, 2001, Science 293, 
10 p.2101. In some embodiments, post-translation modifications or other aspects of the 
state of cellular constituents are analyzed using mass spectrometry. See, for example, 
Aebersold and Goodlett, 2001, Chem Rev 101, p. 269; Petxicoin IE, 2002, The Lancet 
359, p. 572. 

In some embodiments, the proteome of organisms 46 under study is analyzed in 
15 step 7104. The analysis of the proteome {e.g,, the quantification of all proteins and the 
determination of their post-translational modifications) typically involves the use of high- 
throughput protein analysis methods such as microarray technology. See, for example, 
Templin et al, 2002, TRENDS in Biotechnology 20, p. 1 60; Albala and Humphrey- 
Smith, 1999, Curr. Opin. Mol. Ther. 1, p. 680; Cahill, 2000, Proteomics: A Trends Guide, 
20 p. 47-51; Tlmili and Cagney, 2000, Nait BiotechnoL, 18, p. 393; and Mitchell, Nature 
Biotechnology 20, p. 225. 

In stiU other embodiments, 'taixed" aspects of the amounts cellular constituents 
are measured in stq) 7104. In one example, the amounts or concentrations of one set of 
cellular constituents in the organisms 46 under study are combined with measurements of 
25 the activities of certain other cellular constituents in such organisms. 

In some embodiments, different allelic forms of a cellular constituent in a given 
organism are detected and measured in step 7104. For example, in a diploid organism, 
there are two copies of any given gene, one descending from the ^'fathef' and the other 
from the "mother." In some instances, it is possible that each copy of the given gene is 
30 expr^sed at different levels. This is of significant interest since this type of alleiio 

differential e7q}ression could associate with the trait under study, particularly in instances 
where the trait under study is complex. 

133 



wo 2004/061616 PCT/US2003/041613 

Step 7106. Once gene expression / cellular constituent data 44 has been obtained, 
the data is transformed (Fig. 71 A, step 7106) into expression statistics. In some 
embodiments, cellular constituent data 44 (Fig. 1) corrQ>rise5 transcriptional data, 
translational data, activity data, and/or metaboUte abundances for a plurality of cellular 
5 constituents. In one ocnbodim^t, the plurality of cellular constituents comprises at least 
five cellular constituents. In another embodiment, the plurality of cellular constituents 
comprises at least one hxmdred cellular constituents, at least one thousand cellular 
constituents, at least twenty thousand cellular constituents, or more than thirty thousand 
cellular constituents. 

10 The expression statistics commonly used as quantitative traits in the analyses in 

one embodiment of the present invention include, but are not limited to, the mean log 
ratio, log intensity, and background-corrected intensity derived jfrom transcriptional data. 
In other embodiments, other types of egression statistics are used as quantitative traits. 

In one embodiment, this transformation (Fig. 71A, step 7106) is performed using 
15 normalization module (not shown). In such embodiments, the expression level of each of 
a plurality of genes in each organisni under study is normalized. Any normalization 
routine can be used by the normahzation module. Representative normalization routines 
include, but are not limited to, Z-score of intexisily, median intensity, log median 
intensity, Z-score standard deviation log of intensity, Z-score mean absolute deviation of 
20 log intensity calibration DNA gene set, user normalization gene set, ratio median intensity 
correction, and intensity background correction. Furthermore, combinations of 
normalization routines can be run. Exemplary normalization routines in accordance with 
the present invention are disclosed in more detail in Section 5.3. 

St^ 7150. In the preceding stieps, a trait is identified, cellular constituent level 
25 data is measured, and the cellular constituent data is transformed into expression 
statistics. In step 7150 (Fig. 71 A), one or more phenotypes are measured for each 
organism 46 in the population under study. Fig; 72 summarizes the data that is measured 
as a result of steps 7102-7106 and 7150. For each organism 46 in the population under 
study there are at least two classes of data collected. The first class of data collected is 
30 phenotypic information 7201. Phraotypic information 7201 can be anything related to 
the trait under study. For example, phenotypic information 7201 can be a binary event, 
such as whether or not a particular organism exhibits the phenotype (+/-). The 
phenotypic information can be some quantity, such as the results of an obesity 

■ 

4 

134 



wo 2004/061616 PCT/US2003/041613 

. measurement for the respective organism 246. As illustrated in Fig. 72, there caB be more 
than one phenotypic measuranent made per organism 46. 

9 

The second class of data collected for each organism 46 in the population imder 
study is cellular constituent levels SO {e.g. , amounts, abundances) for a plurality of 
5 cellular constituents (steps 1204-1206, Fig. 71 A). Allhougji not illustrated in Fig. 71 , 
there can be several sets of cellular constituent measurements for each organism. Each of 
these sets could represent cellular constituent measurements measured in the respective 
organism 46 after the organism has been subjected to a perturbation that affects the trait 
under study. Representative perturbations include, but are not limited to, exposing the 
10 organism 46 to an amount of a compound. Further, each set of cellular constituents for a 
respective organism 46 could represent measurements taken Scorn a different tissue in the 
organisms. For example, one set of cellular constituent measurements could be from a 
blood sample tak^ from the respective organism while another set of cellular constituent 
measurements coidd be from fat tissue from the respective organism. 

15 Step 7152. In step 7152 (Fig. 71A), the phenotypic data 7201 (Fig. 72) collected 

in step 7150 is used to divide the pop\ilation into phenotypic groiqis 7310 (Fig. 73). The 
method by which step 7 1 52 is accomplished is dependent upon the type of phenotypic 
. data measured in step 7150. For example, in tiie case where the only phenotypic data is 
whether or not the organism 46 exhibits a particular trait, step 71 52 is straightforward. 

20 Those organisms 46 that exhibit the trait are placed in a first group and those organisms 
46 that do not exhibit the trait are placed in a second group. A slightly more complex 
example is where amounts 7201 represent gradations of a quantified trait exhibited by 
each organism 46. For example, in the case where the trait is obesity, each amount 7201 
can correspond to an obesity index (e.g., body mass index, etc.) for the respective 

25 organism 46. In this second example, organisms 46 can be binned into phenotypic groups 
73 1 0 as a fimction of the obesity index. 

In yet another example in accordance with the invention, several phenotypic 
measurements can be collected for a given organism 46. In such embodiments, each 
phenotypic measurement 7201 for a respective organism 46 can be treated as elements of 
30 a phenotypic vector corresponding to the re^ective organism 46. These phenotypic 
vectors can then be clustered using, for example, any of flie clustering techniques 
disclosed in Section 5.5 in order to derive phenotypic groups 7310. To illustrate, in one 
example, the organisms 46 are human and measurements 7201 are derived from a 

135 



wo 2004/061616 PCT/US2003/041613 

standard 12-lead electrocardiogram graph (ECG). The standard 12-lead ECG is a 
rq)resentation of the heart's electrical activity recorded from electrodes on the body 
surface. The ECG provides a wealth of phenotypic data including, but not limited to, 
heart rate, heart rhythm, conduction, wave form descriptioii, and ECG inteipretation 
S (typically a binary event, normal, abnormal). Each of these different phenotypes 
(heart rate, heart riiythm) can be quantified as elements in a phenotypic vector. Further, 
some elements of the phenotypic vector (e.^ ., ECG inteipretation) can be given more 
weight during clustering. For instance, the ECG measurements can be augmented by 
additional phenotypes such as blood cholesterol level, blood triglyceride level, sex, or age 
10 in order to derive a phenotypic vector for each respective organism 46. Once suitable 
phenotypic vectors are constructed, they can be clustered using any of the clustering 
algorithms in Section S.S in order to identify phenotypic groups 73 10. 

In some embodiments, step 7152 is an iterative process in which various 
phenotypic vectors are constructed and clustered imtil a form of phenotypic vector that 
IS produces clear, distinct groups is identified. Of particular interest are those phenotypic 
vectors that are capable of producing phenotypic groups 7310 that are uniquely 
characterized by certain phenotypes (e.g., an abnormal ECG/ high cholesterol subgroup, a 
normal ECG/ low cholesterol subgroup). 

Using the example presented above, phenotypic vectors that can be iteratively 
20 tested include a vector that has ECG data only, one that has blood measurements only, 
one that is a combination 6f tiie ECG data and blood measurements, one that has only 
select ECG data, one that has weighted ECG data, and so forth. Furthermore, optimal 
phenotypic vectors can be identified using search techniques such as stochastic search 
techniques (e.g., simulated annealing, genetic algori&m). See, for example, Duda et al, 
25 2001, Pattern Recognition, second edition, John Wiley & Sons, New York. 

Step 7154. In step 7154, the phenotypic extremes within the population are 
identified. For example, in one case, the trait of interest is obesity. Li step 7154, very 
obese and very skinny organisms 46 can be selected as the phenotypic extremes. In oiie 

embodiment of the present invention, a phmotypic extreme is defined as the top or lowest 
30 40*, 30*, 20*, or 10* percentile of the population with respect to a given phenotype 
exhibited by the population. 

Step 7156. Jn step 7156, a plurality of cellular constituents (levels 50, Fig. 72) for 
the species rq)resented by organisms 46 are filtered. Only levels 50 measured for 

136 



wo 2004/061616 PCT/US2003/041613 

phenotypically extreme organisms 46 selected in step 7154 are used in this filtering. To 
illustrate using Fig. 73, consider the case in which organism 46-1 and organism 46'-N 
represent phenotypic extremes with respect to some phenotype whereas organism 46-2 
does not. Then, in this instance, levels 50 measured for organism 46-6 and 46-N will be 
5 considered in the filtering whereas levels 50 measured for organism 46-2 will not be 
considered in the filtering. . 

In some embodiments, cellular constituent levels 50 (measured in phenotypically 
extreme organisms) for a given cellular constituent 48 are subjected to a t-test (or a 
multivariate test) to determine whether the given cellular constituent 48 can discriminate 

10 between the phenotypic groups 7310 (Fig. 73) that were idmtified in step 7152, above. A 
cellular constituent 48 will discriminate between phenotypic groups when the cellular 
constituent is found at characteristically different levels in each of the phenotypic groups 
73 1 0. For example, in the case where there are two phenotypic groi5>s 73 1 0, a cellular 
constituent will discriminate between the two groups 7310 when levels 50 of the cellular 

15 constituent (measured in phenotypically extreme organisms) are found at a first level in 
the first phenotypic group and are foimd at a second level in the second phenotypic groiQ), 
where the first and second level are distinctly different 

In preferred embodiments, each cellular constituent is subjected to a t-test without 
consideration of the other cellular constituents in the organism. However, in other 
20 embodiments, groups of cellular constituents are compared in a multivariate analysis in 
step 7 1 56 in order to identify those cellular constituents that discriminate between 
phenotypic groups 7310. 

Step 7158. Typically, there will be a large nimiber of cellular constituents 
expressed m phenotypically extreme organisms that appear to differentiate between the 

25 phenotypic groups identified in step 7152. In some instances, this number of cellular 

constituents 48 can exceed the number of organisms 46 available for study. For instance, 
in some embodiments, 25,000 genes or more are considered in previous steps. Thus, 
there may be hundreds if not thousands of genes that discriminate. In some instances, 
these discriminating cellular constituents are analyzed in subsequent steps with statistical 

30 models that involve many .statistical parameters that increase with the numbo: of 

predictors. In such instances, it is desirable to reduce the number of cellular constituents 
, using a reducing algorithm. However, in other instances, other forms of statistical 



137 ' 



wo 2004/061616 PCTAJS2003/041613 

analysis are used that do not require reduction in fh& number of cellular constituents under 
consideration. 

Hie reducing algorithms that are optionally used in step 7158 use the p- value or 
olh^ form of metric computed for each cellular constituent in step 7156 as a basis for 
5 reducing the dimensionality of the cellular constituent set identiJSed in step 7156. A few 
exemplary reducing algorithms will be discussed. However, those of skill in the art will 
appreciate that many reducing algorithms are known in the art and all such algorithms can 
be used in stqp 7158. 

One reducing algorithm is stepwise regression. The basic procedure in stepwise 
10 regression involves (1) identifyiag an initial model (e.g., an initial set of cellular 
constituents), (2) iteratively "stepping," that is, repeatedly altering the model at the 
previous step by adding or removing a predictor variable . (cellular constituent) in 
accoidance with the "stepping criteria," and (3) tenninating the search when stepping is 
no longer possible given the stepping criteria, or when a specified maximum number of 
IS steps has been reached. Forward stepwise regression starts with no model terms {le. , no 
cellular constituents). At ea)ch step the regression adds the most statistically significant 
term until there are none left. Backward stepwise regression starts with all the terms in 
• the model and removes the least significant cellular constituents until all the remaining 
cellular constituents are statistically significant It is also possible to start with a subset of 
20 all the cellular constituents and then add significant cellular constituents or remove 
insignificant cellular constituents until a desired dimensionahty reduction is achieved. 

Another reducing algorithm that can be used in step 7158 is all-possible-subset 
regression. In fact, all-possible-subset regression can be used in conjunction with 
stepwise regressigiL The stepwise regression search approach presumes there is a single 

25 *l>est" subset of cellular constituents and seeks to identify it. hi the all-possible-subset 
regression approach, the range of subset sizes that could be considered to be useful is 
noiade. Only the "best" of all possible subsets within this range of subset sizes are then 
consid^ed Several different criteria can be used for ordering subsets in terms of 
"goodness", such as multiple R-square, adjusted R-square, and Mallow's Cp statistics. 

30 When all-possible-subset regression is used in conjunction with stepwise methods, the . 
subset multiple R-square statistic allows direct comparisons of the "best" subsets 
identified using each approach. 

138 



wo 2004/061616 PCTA7S2003/O41613 

Another approach to reducing higher dimensional space into lower dimensional 
space in accordance with step 7158 (Fig. 71 A) of the present invention is the use of linear 
combinations of cellular constituents. In effect, linear methods project high-dimensional 
data onto a lower dimensional space. Two approaches for accomplishing this projection 
5 include Principal Component Analysis (PCA) and Multiple-Discriminant Analysis 
CMDA). PCA seeks a projection that best represents the data in a least-squares sense 
whereas MDA seeks a projection that bests separates the data in a least-squares sense. 
See, for example, Duda et al, 2001, Pattern Classification^ Chapters 3 and 10. 

The ultimate goal of step 7158 is to id^tify a classifier derived firom the set of 
10 ^ cellular constituents identified in step 7 1 56 or a subset of the cellular constituents 
identified in step 7156 that satisfactorily classifies organisms 46 into the phenotypic 
groups 7310 identified in step 7152. In some embodiments of the present invention, 
stochastic search methods such as simulated annealing can be used to identify such a 
classifier or subset. In the simulated annealing approach, for example, each cellular 
15 constituent under consideration can be assigned a weight in a fimction that assesses the 
aggregate ability of the set of cellular constituents identified in step 7156 to discriminate 
the organisms 46 into the phenotypic classes identified in step 7152. During the 
simulated annealing algorithm these wei^ts can be adjusted. In fact, some cellular 
constituents can be assigned a zero weight and, therefore, be effectively eliminated dining 
20 the anneal thereby effectively reducing the number of cellular constituents used in 

subsequent steps. Other stochastic methods that can be used in step 7158 include, but are 
not limited to, genetic algorithms. See, for example, the stochastic methods in Chapter 7 
of Duda et al^ 2001, Pattern Classification, second edition, John Wiley & Sons, New 
York. 

25 Step 7160. In some embodiments, the cellular constituents identified in steps 

7156 and/or 7158 are clustered in order to fijrther identify subgroups within each 
phenotypic subpopulation. To perform such clustering, an expression vector is created 
for each cellular constituent und^ consideration. To create an expression vector for a 
respective cellular constituent, the levels 7201 measured for the respective cellular 

30 constitumt in each of the phenotypically extreme organisms is used as an elemmt in the 
vector; For example, coiisider the case in which an expression vector for cellular 
constituent 48-1 is to be constructed fixim organisms 46-1, 46-2, and 46-3. Levels 50-1-1, 
50-2-1, and 50-3-1 woxild serve as the three elements of the expression vector that 
represents cellular constituent 48-1. Each of the expression vectors are then clustered 

139 



wo 20Q4/061616 PCT/US2003/041613 

using, for example, any of the clustering techniques described in Section 5.5. In one 
embodiment, k-means clustering (Section 5.5.2) is used. 

An advantage of step 7 160 is that subpopulations 7320 (Fig. 73) that cannot be 
differentiated based upon phenotype can be identified. Such subgroups 7320 can be used 
5 to refine a classifier that classifies organisms into classes, as detailed in the following 
steps. 

Stqp 7164. In step 7164, the set of cellular constituents identified as 
discriminators between phenotypic extremes that were identified in previous steps (or 
principal components derived firom such cellular constituents) are used to build a 
10 classifier. This set of cellular constituents actually refines the definition of the clinical 
phenotype under study. 

A number of pattern classification techniques can be used to accomplish this task, 
including, but not limited to, Bayesian decision theory, maximum-likelihood estimation, 
linear discriminant fimctions, multilayer neural networks, and supervised as well as 
IS unsupervised learning. 

In one embodiment in accordance with step 7 1 64, the set of cellular constituents 
that discriminate the phenotypically extreme organisms into phenotypic groups is used to 
train a neural network using, for example, a back-propagation algorithm. In this 
embodiment^ the neural network serves as a classifier. First, the neural network is trained 

20 with the set of cellular constituents that discriminate the phenotypically extreme 

organisms into phenotypic groups. In more detail, the cellular constituent values (e.g., 
measured levels 50 of cellular constituents 48 selected in previous stqps) fi'om all the 
organisms 46 in the phenotypically extreme groups are used to train the neural network. 
Then, the trained neural network is used to classify the general population into phenotypic 

25 groups. In some embodiments the neural network that is trained is a multilayer neural 
network. In other embodiments, a projection pursuit regression, a generalized additive 
model, or a multivariate adaptive regression spline is used. See for, example, any of tlie 
techniques disclosed in Cbsptes: 6 of Duda et al, 2001, Pattern Classification, second 
edition, John Wiley & Sons, Inc., New York. 

30 In another embodiment in accordance with step 7 1 64, Bayesian decision theory 

can be used to build a classifi^ using the selected cellular constituent data. Bayesian 
decision theory plays a role when there is some a prioi information about the things to be 
classified. Here, the set of cellular constituents that discriminate flie phenotypically 

140 



wo 2004/061616 PCT/US2003/041613 

extreme organisms into phenotypic groups serves as the a priori information. More 
specifically, the intensity or cellular constituent levels 50 for the cellular constituents 48 
selected in steps 7156-7160 from each of the phenotypically extreme organisms 46 serve 
as the a priori information. For more information on Bayesian decision theory, see for, 

> example, any of the techniques disclosed in Clusters 2 and 3 of Duda et aL, 2001, 
Pattern Classification, second edition, John Wiley & Sons, Inc., New York. 

In still another embodiment in accordance with step 7164, linear discriminate 

♦ ■ • 

analysis (functions), linear prograimning algorithms, or support vector machines are used 
to create a classifier that is capable of classifying the general population of organisms 46 
) into phenotypic groups 7310. This classification is based on the cellular constituent data 
50 for the cellular constituents 248 that refined the definition of the clinical phenotype 
(z.e. the cellular constituents selected in steps 7156, 7158, and/or 7160. For more 
information on this class of pattem classification fimctions, see for, example, any of the 
techniques disclosed in Chapter 5 of Duda et al, 2001, Pattem Classification, second 

> edition, John Wiley & Sons, Inc., New York. 

Step 7166, In step 7166, the classifier derived in step 7164 is used to classify all 
or a substantial portion {e.g., more than 30%, more than 50%, more than 75%) of the 
population under study. Essentially, the classifier bins the remaining population (the 
portions of the population that do not include the phenotypic extremes) without taking 

3 their phenotypic (e.g. , phenotype amounts 7201, Fig. 72) into consid^iation. The process 

I 

of using the classifi^ to classify the general population produces phenotypic subgroups 

■ 

7350 (Fig. 73). Phenotypic subgroups 7350 are, in feet, a refinement of the trait under 
study. 

Step 7168, The steps leading to and including step 7160 serve to identify cellular 
5 constituents that are capable of classifying organisms mto phenotypic groups. In step 
7164, this set of cellular constituents is used to construct a classifier that is capable of 
classifying the general population under study into phenotypic groups 7310. Li many 
pattem classification techniques, such as a back-propagation algorithm that uses a 
multilayer network, the classifier constracted in step 7164 will no longer be the simple 
D subset of cellular conistituents identified in stq)s 7156 through 7160. Rather, the form of 
the classifier will depend on the type of pattem recognition technique used to develop the 
classifier. In some embodim^its, however, the classifier derived in step 7164 can be a set 



4 



wo 2004/061616 PCT/US2003/041613 

of cellular constituents in the case where the classification scheme is a siiiq)le decision 
tree (e.g., if level for constituent 5 is greater than 50 than place in phenotypic class B). 

Regardless of its form, the classifier formed in step 7164 serves to further refine 
the phenotypic groups 7310 defined in step 7152 or the subgroiq)S 7320 defined in step 

5 7160. As such, the methods disclosed in this section can be used to refine a trait under . 
Study. This refinement is illustrated in Fig. 73. At the outset, the trait under study is 
exhibited by some population 7300 of organisms 46. In step 7152 of the method, 
observation of gross (visible, measurable) phenotypes (other than cellular constituent 
levels) related to the trait are used to divide the general population 7300 into two or more 

10 phenotypic groins 73 10 (Fig. 73). In step 7160 of the method, optional clustering of 
select cellular constituents serves to refine a phenotypic gro\q> into subphenotypic groups 

* 4 

7320 (Fig. 73). 

A benefit of step 7160 is that the clustering in step 7160 refines the trait under 
study into groups 7320 (Fig. 73) that are not distinguishable using gross observable 

15. phenotypic data (other than cellular constituent levels) such as amoimts 7201 (Fig. 72), 
. As such, optional step 7160 provides a powerful way to refine the definition of the 
clinical trait under study by focusing on those cellular constituents that actually give rise 
to the clinical trait or well reflects the varied biochemical response to that trait. However, 
the refinement provided in step 7160 is incomplete because it is based on only a select 

20 portion of the general population under study, those organisms that represent phenotypic 
extremes. Accordmgly, in step 7164, a more robust classifier is built using the initial set 
of cellular constituents selected based upon phenotypic extremes organisms 46 as a 
starting point. As illustrated in Fig. 73, in step 7166, the classifier derived in step 7164 
classifies the trait under study into highly refined subgroups 7350. Thus, although only 

25 gross categories such as groups 73 10 or 7320 were used to develop the classifier, the 
classifier will split the population into clusters that can fall within groups 73 10 and/or 
1 120. These clusters are denoted as subgroups 7350 in Fig 73. Each of Uiese subgroups 
7350 serves to refine tiie trait under study. In other words, each of the subgroups 7350 is 
a more homogenous fonn of the overall trait under study. The classifier classifies the 

30 general population without considering phenotypic data (e.g. , levels 7201 , Fig: .72). 
Therefore, it is possible that the groups 7350 will not fall neatly within groups 7320 
and/or 7310. 



142 



wo 2004/061616 PCT/US2003/041613 

The classifier developed using the methods described in this section serves to 
refine the definition of a trait of interest Thus, each group 7350 in Fig. 73 identified 
using the classifier represents a more homogenous population with respect to the trait of 
interest. Cellular constitu^t measurements fiom organisms in respective groups 7350 
5 can be used as quantitative traits in quantitative genetic studies such as linkage analysis 
(Section 5.13) or association analysis (5.14). It is expected that hnkage analysis and/or 
association analysis using data fiom individual groups 7350 rather than the general 
population will provide improved results, particularly in situations where the trait under 
study is complex and/or is driven by many different genes. In such instances, the 

10 individual groups 7350 could represent a more homogenous population or state. 

Consequentiy tiie genes that drive or link to the QTL (or loci) patterns in such populations 
7350 could be easier to identify than in the case where cellular constituent data form the 
entire population is used as quantitative traits in such studies. An example where 
quantitative genetic analysis on subgroups rather than the general population was used to 

IS identify genes associated with a trait of interest is provided in Schadt ei al. , 2003, Nature 
422, p. 297. 

5.18. OBESITY RELATED GENES AND OBESITY RELATED GENE 

PRODUCTS 

20 In Section 6.7.1, four genes were discovered on mouse chromosome number 2 by 

co-localizing cQTL for the mouse obesity related traits (1) subcutaneous fat pad mass 
OFig. 20, curve 2002), (2) perimetrial fat pad mass (Fig. 20, curve 2004), (3) omental fet 
pad mass (Fig. 20, curve 2006), and (4) adiposity (Fig. 20, curve 2008) with four eQTL 
with lod scores greater than 3.0 that correspond to genes whose physical locations are 

25 within the vicinity {e,g. , 2 cM) of title four cQTL. 

Using the methods of the present invention as described, for example, in Section 
5.17 above, the human orthologs to these four mouse genes were determined. The four 
mouse genes, their gene products, and their hiunan orthologs are summarized in Tables 3 
and 4 in Section 6.7.5 below (SEQ ID NO: 1 through SEQ ID NO: 29). Together, these 
30 genes and proteins are referred to as "obesity related genes" and "obesity related gene 
products'* of the present ihventio£ 

The term "obesity related genes" includes cDNAs or other nucleic acids that 
encode any one of SEQ ID NO: 4, SEQ ID NO: 5, SEQ ID NO: 6, SEQ ID NO: 7, SEQ 



143 



wo 2004/061616 PCT/DS2003/041613 

ID NO: 8, SEQ ID NO: 10, SEQ ID NO: 1 1, SEQ ID NO: 14, SEQ ID NO: 15, SEQ ID 
NO: 17, SEQ ID NO: 18, SEQ ID NO: 21, SEQ ID NO: 22, SEQ ID NO: 23, SEQ ID 
NO: 24, SEQ ID NO: 25, SEQ ID NO: 26, SEQ ID NO: 27, SEQ ID NO: 28, and SEQ ID 
NO: 29 in whole or in part As sach^ the term "obesity related genes" includes the nucleic 
5 acid sequence set forth in SEQ ID NO: 1 , SEQ ID NO: 2, SEQ ID NO: 3, SEQ ID NO: 9, 
SEQ ID NO: 12, SEQ ID NO: 13, SEQ ID NO: 16, SEQ ID NO: 19, SEQ ID NO: 20. 
The "obesity related genes" of the invention include human and mouse genes as well as 
related genes (homologs or ortiiologs) in other species. . 

The term "obesity related gene products" includes amino acid macromolecules 
10 that includes a sequence as substantially set forth in any one of SEQ ID NO: 4, SEQ ID 
NO: 5, SEQ ID NO: 6, SEQ ID NO: 7, SEQ ID NO: 8, SEQ ID NO: 10, SEQ ID NO: 1 1, 
SEQ ID NO: 14, SEQ ID NO: 15, SEQ ID NO: 17, SEQ ID NO: 18, SEQ ID NO: 21, 
SEQ ID NO: 22, SEQ ID NO: 23, SEQ ID NO: 24, SEQ ID NO: 25, SEQ ID NO: 26, 
SEQ ID NO: 27, SEQ ID NO: 28, and SEQ ID NO: 29 as described in Section 6.7.5. 

15 In specific embodiments, the "obesity related genes" and "obesity related gene 

products" are &om vertebrates, or more particularly, mammals. Production of the 
foregoing proteins and derivatives, e.g., by recombinant methods, is provided. 

The invention also relates to obesity related gene products that are functionally 
active, z.e., they are capable of displaying one or niore known functional activities 
20 associated with a full-length (wild-type) obesity related gene product. 

The invention further relates to fragments, and derivatives thereof, of obesity 
related gene products that comprise one or more domains of an obesity related gene 
product Antibodies to obesity related gene products and derivatives of such antibodies 
(e.g,, the binding domain of such antibodies) are furttier provided by the present 
25 invention. 

* 

The present invention further relates to therapeutic and diagnostic methods and 
compositions based on obesity related genes and/or obesity related gene products as well 
as antibodies that bind to the obesity related gene products. 

Animal models, diagnostic methods and scre^iing methods for predisposition to 
30 disorders are also provided by the invention. 

The invention further provides methods of treatment of obesity and obesity related 
diseases such as anorexia nervosa, bulimia nervosa, and cachexia using modulators of the 



144 



wo 2004/061616 PCT/US2003/041613 

. obesity related genes and obesity related gene products described in Section 6.7.5. 
Modulators, e.g,y inhibitors and agonists, of the obesity related genes and obesity related 
gene products described in Section 6.7.5. can be identified by any method known in the 
art In particular, molecules can be assayed for flieir ability to promote or inhibit 
5 (modulate) the expression of the obesity related genes described in Section 6.7.5, Once 
modulators are identified, they can be assayed for therapeutic efficacy using any assay 
available in the art for obesity. 

Modulators may be identified by screening for molecules that bind to the obesity 
related gene products described in Section 6.7.5. Molecules that bind such gene products 

1 0 may be identified in many ways that are well known and routine in the art. For example, 
but not by way of limitation, by overexpressing such a gene product (e.g., SEQ ID NO: 8, 
SEQ ID NO: 1 1, SEQ ID NO: 15, SEQ ID NO: 1 8, SEQ ID NO: 25, SEQ ID NO: 26, or 
SEQ ID NO: 27) in a cell line that endogenously expresses little or none of the gene 
product and assaying for molecules that bind to the cells overexpressing the gene product 

1 5 and that do not bind to the cells not overexpressing the gene product; or by conjugating 

the gene product to. a solid support (e.g., a chromatography resin) contacting the 

■ * . ' 

conjugated gene product to a soUd support Avith a molecule of interest, isolating the solid • 
support and determining whether the molecule of interest bound to the gene product. 
Other methods include, screening phage display libraries, combinatorial chemical 
20 hbraries and the like for binding to one or more of the gene products described in Section 
6.7.5. 

5.18.1. SCREENING FOR OBESITY RELATED GENE AGONISTS AND 

ANTAGONISTS 

25 The amino acid and nucleotide sequences for the obesity related genes and obesity 

related gene products of the present invention are illustrated in Figs. 27-32, 34-38, and 
40-53. Thus, these nucleotide and amino acid sequences can be used to prepare protein 
for screening by methods that are routine and well known in the art {see, Sambrook 
et aL, 2001, Molecular Cloning, A Laboratory Manual, Third Edition, Cold Spring 

30 Harbor Laboratory Press, N.Y.; and Ausubel et al, 1989, Current Protocols in Molecular 
Biology, Green Publishiiig Associates and Wiley Interscience, N.Y., both of which are 
hereby incorporated by reference in their entireti^). For example, using any of the gene 
sequences disclosed in Section 6.7.5, oligonucleotide primers for PCR amplification can 
be designed. PCR amplification is then used to amplify specifically the obesity related 

145 



wo 2004/061616 PCT/US2003/041613 

protein coding sequence, which can be cloned into an appropriate expression vector using 
routine techniques. That vector may then be introduced into bacterial or cultured 
eukaryotic cells (e.g., cultured mammalian cells, insect cells, etc) such that the obesity 
related gene product is expressed in the bacterial or cultured cell. The obesity related 
5 gene product may then be isolated from the bacterial or eukaryotic cell culture. 

By way of example, diversity libraries, such as random or combinatorial peptide 
or nonpeptide libraries can be screened for molecules that specifically bmd to and/or 
modulate the function of the obesity related gene product. Many libraries are known in 
the art that can be used, e.g., chemically synthesized libraries, recombiaant (e.g., phage 
10 display libraries), and in vitro translation-based libraries. 

Examples of chemically synthesized libraries are described in Fodor et aL, 1991, 
Science 251:767-773; Houghten et al, 1991, Nature 354:84-86; Lam et aL, 1991, 
Nature 354:82-84; Medynski, 1994, Bio/Technology 12:709-710; Gallop et al, 1994, J. 
Medicinal Chemistry 37(9):1233-1251; Ohhneyer et al, 1993, Proc. Natl.. Acad. Sci. 
15 USA90:10922-10926;Erbe^a/., 1994,Proc.Natl.AcadSci.USA91:11422-11426; 
Houghten etal, 1992, Biotechniques 13:412; Jayawickreme et al, 1994, Proc. Natl. 
Acad. Sci. USA 91:1614-1618; Salmon et al, 1993, Proc. Natl. Acad. Sci. USA 
90:1 1708-1 1712; PCT Publication No.' WO 93/20242; and Brenner and Lemer, 1992, 
Proc. Natl. Acad. Sci. USA 89:5381-5383. 

20 Examples of phage display libraries are desoibed in Scott and Smith, 1990, 

Science 249:386-390; Devlin et al, 1990, Science, 249:404-406; Christian, R.B., et al, 
1992, J. Mol. Biol. 227:711-718; Lensira, 1992, J. Immunol. MettL 152:149-157; Kay et 
al., 1993, Gene 128:59-65; and PCT Publication No. WO 94/18318 dated August 18, 
1994. 

25 In vitro translation-based libraries include but are not limited to those described in 

PCT Publication No. WO 91/05058 dated April 18, 1991; and Mattheakis et al., 1994, 
Proc. Natl. Acad. Sci. USA 91 :9022-9026. 

By way of examples of nonpeptide libraries, a boazodiazepinfe libraiy {see e.g., 
Bunin et al., 1994, Proc. Natl. Acad. Sci. USA 91 :4708-4712) can be ad^ted for use. 
30 Peptoid Ubraries (Simon et al., 1992, Proc. Natt. Acad. Sci. USA 89:9367-9371) can also 
be used. Another example of a library that can be used, in which the aniide 
functionalities in peptides have been pennethylated to generate a chemically transformed 



146 



wo 2004/061616 PCTAJS2003/041613 

combinatorial library, fs described by Ostresh et al (1994, Proc. Natl. Acad. Sci. USA 
91:11138-11142). 

Screening the libraries can be accomplished by any of a variety of commonly 
known methods. See, e.g., the follo\wing refereaces, which disclose screening of peptide. 

5 libraries: Paimley and Smith, 1989. Adv. Ejq). Med. Biol. 251:215-218; Scott and Smith, 
1990, Science 249:386-390; Fowlkes et al, 1992; BioTechniques 13:422-427; Oldenburg 
et al, 1992, Proc. Natl. Acad. Sci. USA 89:5393-5397; Yu et al., 1994, Cell 76:933-945; 
Staudt et al, 1988, Science 241:577-580; Bock et al, 1992, Nature 355:564-566; Tyerk 
et al, 1992, Proc. Natl. Acad. Sci USA 89:6988-6992; Ellington et al, 1992, Nature 

10 355:850-852; U.S. Patent No. 5,096,81 5, U.S. Patent No. 5,223,409, and U.S. Patent No. 
5,198,346, all to Ladner et al; Rebar and Pabo, 1993, Science 263:671-673; and PCT 
PubUcationNo. WO 94/18318. 

In a specific embodiment, screening can be carried out by contacting the library 
members with an obesity related goie product disclosed in Section 6.7.5 (or nucleic acid 
15 or d^vative) immobilized on a solid phase and harvesting those library members that 
• bind to the protein (or nucleic acid or derivative). Examples of such screening metiiods, 
termed "panning" techniques are described by way of example in Pannley and Smith, 
1988, Gene 73:305-318; Fowlkes et al, 1992, BioTechniques 13:422-427; PCT 
Publication No. WO 94/183 18; and in references cited hereinabove. 

20 In another embodiment, the two-hybrid systan for selecting interacting proteins in 

. yeast (Fields and Song, 1989, Nature 340:245-246; Chiea et al., 1991, Proc. Natl. Acad. 

• * 

Sci. USA 88:9578-9582) can be used to identify molecules that specifically bind to an 
obesity related gene product disclosed in Section 6.7.5 or derivative. 

25 5.18.2. ISOLATION OF OBESITY RELATED GENES 

The invention relates to the nucleotide sequences of nucleic acids. In a specific 
embodiment, the invention relates to nucleic acids that encode an amino acid sequence 
substantially as set forth in SEQ ID NO: 8 (Fig. 31), such as, for example, SEQ ID NO: 2 
and SEQ ID NO: 3 (Fig. 28, 29). In further specific embodiments, nucleic acids of . the 

30 - present invention comprise the cDNA sequences of SEQ ID NO: 2 or SEQ ID NO: 3, the 
coding regions thereof^ or the complements thereto. 



147 



\yO2004/061616 PCT/US2003/041613 

The invention provides purified nucleic acids consisting of at least 10 nucleotides 
a hybridizable portion) of a nucleotide sequmce encoding a SEQ ID NO: 2 or SEQ 
ID N0:3; in other embodiments, the nucleic acids consist of at least 10, 20, SO, 100, ISO, 
or 200 contiguous nucleotides of a nucleotide sequence encoding SEQ ID NO: 2 or SEQ 
S ID NO: 3, or a full-length coding sequence. In another embodiment, the nucleic acids are 
^ smaller than 35, 200 or 500 nucleotides in length. Nucleic acids can be single or double 
stranded. In another embodiment, the nucleic acids comprise a sequence of at least 10 
nucleotides that encode a firagment of SEQ ID NO: 8, wherein the firagment of the SEQ 
ID NO: 8 displays one or more functional activities of SEQ ID NO: 8. 

10 

5.18.2.1. LOW STRINGENCY CONDITIONS 
The invention also relates to nucleic acids hybridizable to or complementary to the 
foregoing sequences. In specific aspects, nucleic acids are provided that comprise a 
sequence complementary to at least 10, 25, 50, 100, or 200 nucleotides or the entire 
IS coding region of a gene encoding SEQ ID NO: 8. The invention further relates to nucleic 
acid sequences that bind und^ conditions of low stringency to a nucleic acid that encodes 
SEQ ID NO: 8. 

By way of example and not limitation, procedures using such conditions of low 
stringency are as follows (see also Shilo and Weinberg, 1981, Froc. Natl. Acad. Sci. 

20 U.S.A. 78:6789-6792): Filters containing DNA are pretreated for 6 h at 40*=^C in a 

solution containing 35% formamide, 5X SSC, 50 mM Tris-HCl (pH 7.5), 5 mM EDTA, 
0.1% PVP, 0.1% Ficoll, 1% BSA, and 500 mg/ml denatured sahnon spenn DNA. 
Hybridizations are carried out in the same solution with the following modifications: 
0.02% PVP, 0.02% FicoU, 0.2% BSA, 100 mg g/ml sahnon sperm DNA, 10% (wt/vol) 

25 dextran sulfate, and 5-20 X 106 cpm 32P-labeled probe is used. Filters are incubated in 
hybridization mixture for 18-20 h at 40^C, and then washed for 1.5 h at 55°C in a solution 
containing 2X SSC, 25 mM Tris-HCl (pH 7.4), 5 mM EDTA, and 0.1% SDS. The wash 
solution is replaced with firesh solution and incubated an additional 1 .5 h at 60^C. Filters 
are blotted dry and exposed for autoradiography. If necessary, filters are washed for a 

30 third time at 65-68°C and reexposed to film. Other conditions of low stringency that rnay 
be used are well known in the art {e.g., as employed for cross-species hybridizations). 



148 



wo 2004/061616 PCT/US2003/041613 

5.18.2.2. HIGH STRINGENCY CONDITIONS 
In anofhCT specific embodiment, a nucleic acid hybridizable to a nucleic acid 
encoding SEQ ID NO: 8 under conditions of high stringency is provided. By way of 
example and not limitation, procedures using such conditions of high stringency are as 
5 follows. Prehybridization of filters containing DNA is carried out for 8 hours to 
overnight at eS^'C in buffer composed of 6X SSC, 50 mM Tris-HCl (pH 7.5), 1 mM 
EDTA, 0.02% PVP, 0.02% FicoU, 0.02% BSA, and 500 mg/ml denatured saknon sperm 

* • 

DNA. Filters are hybridized for 48 hours at 65^C in prehybridization mixture containing 
100 mg/ml denatured salmon sperm DNA and 5-20 X 106 cpm of 32P-labeled probe. 
10 Washing of filters is done at 37°C for one hour in a solution containing 2X SSC, 0.01% 
PVP, 0.01% FicoU, and 0.01% BSA. This is followed by a wash in O.IX SSC at 50°C for 
45 minutes before autoradiography. Other conditions of high stringency that niay be used 

* 

are well known in the art. 



15 S.18J.3. MODERATE STRINGENCY CONDmONS 

In another specific embodiment, a nucleic acid that is hybridizable to a nucleic 
acid encoding SEQ ID NO: 8 under conditions of moderate stringency is provided. As 
used herein, conditions of moderate stringency, as known to those having ordinary skill in 
the art, and as defined by Sambrook et aL, Molecular Cloning: A Laboratory Manual^ 2^^ 

20 Ed. Vol. 1, pp. LlOl-104, Cold Spring Harbor Laboratory Press, 1989), include use of a 
prewashing solution for the nitrocellulose filters 5X SSC, 0.5% SDS, 1.0 mM EDTA (pH 
8.0), hybridization conditions of 50% formamide, 6X SSC at 42°C (or other similar 
hybridization solution, or Stark's solution, in 50% formamide at 42*^0), and washing 
conditions of about 60^*0, 0.5X SSC, 0.1% SDS. See also, Ausubel et a/., eds., in the 

25 Current Protocols in Molecular Biology series of laboratory technique manuals, © 1987- 
1997, Current Protocols, © 19944997, John Wiley and Sons, Inc.). The skilled artisan 

« * * 

will recognize tibiat the temperature, salt conc^itration, and chaotrope composition of 
hybridization and wash solutions may be adjusted as necessary according to factors such 
aa the length and nucleotide base composition of the probe. 



30 



5.1&2.4. DERIVATIVES AND ANTISENSE NUCLEIC ACIDS 
Nucleic acids ^coding derivatives of SEQ ID NO: 8 and antisense nucleic adds 
to g^es encoding SEQ ID NO: 8 are additionally provided As is readily apparent, as 

149 



wo 2004/061616 PCTAJS2003/041613 

used herein, a nucleic acid encoding a fragment or portion of SEQ ID NO: 8 shall be 
construed as referring to a nucleic acid encoding only the recited fragment or portion of 
the specific SEQ ID NO: 8 and not the other contiguous portions of SEQ ID NO: 8 as a 
continuous sequence. 



5.18.2.5. MOLECULAR BIOLOGY 

For expression cloning (a technique commonly known in the art), an expression 
library is constructed by methods known in the art. For example, mRNA {e.g., human) is 
isolated, cDNA is made and Ugated into an expression vector (e.g.^ a bacteriophage 
derivative) such that it is capable of being expressed by flie host cell into which it is then 
introduced. Various screening assays can then be used to select for the expressed protein 
product In one embodiment, anti-SEQ ID NO: 8 antibodies can be used for selection. 

In another embodiment of the invention, polymerase chain reaction (PGR) is used 
to ampUfy the desired sequence in a genomic or cDNA library, prior to selection. 
Oligonucleotide primers representing known SEQ ID NO: 8-encoding sequences can be 
used as primers in FOR.. In a preferred aspect, the ohgonucleotide primers represent at 
least part of the consCTved segments of strong homology between SEQ ID NO: 
8-encoding genes of different species. The synthetic oligonucleotides may be utilized as 
primers to amplify by PGR sequences from RNA or DNA, preferably a cDNA Ubrary, of 
potential interest Alternatively, one can synthesize degenerate primers for use in the 
PGR reactions. 

In PGR according to the invention, the nucleic acid being amplified can include 

» • 

RNA or DNA, for example, mRNA, cDNA or genomic DNA from any eukaryotic 
species. PGR can be earned out, eg., by use of aPeridn-Ehner Getus thermal cycler and 
Taq polymerase. It is also possible to vary the stringency of hybridization conditions 
used in priming the PGR reactions, to allow for greater or lesser degrees of nucleotide 
sequence similarity between a known TGAP nucleotide sequence and a nucleic acid 
homolog being isolated. For cross-species hybridization, low stringency conditions are 
preferred. For same-species hybridization, moderately stringent conditions are preferred. 
After successful amplification of a segment of a SEQ ID NO: 8 gene homolog, that 
segpient may be cloned, sequenced, and utilized as a probe to isolate a complete cDNA or 
genomic clone. This, in tum, will pemsit the determination of the gene's complete 
nucleotide sequence, the analysis of its expression, and the production of its protein 



r 



150 



wo 2004/061616 PCTAJS2003/041613 

product for functional analysis, as described infra. In this fashion, additional genes 
encoding SEQ ID NO: 8 may be idmtifieA 

The above recited methods are not meant to limit the following general 
description of methods by which clones of genes encoding SEQ ID NO: 8 may be 
S obtained. 

Any eukaryotic cell potentially can serve as the nucleic acid source for the 
molecular cloning of a SEQ ID NO: 8-encoding gene. The nucleic acid sequences 
encoding a SEQ ID NO: 8 homolog or ortholog can be isolated from vertebrate, 
manunalian^ human, porcine, bovine, feline, avian, equine, canine, as well as additional 

10 primate sources. The DNA may be obtained by standard procedures known in the art 
from cloned DNA (e.g., a DNA "library^'), by chemical synthesis, by cDNA cloning, or 
by the cloning of genomic DNA, or fragments thereof, purified from the desired cell. 
(See, for example, Sambrook et aly 1989 Molecular Cloning, A Laboratory Manual, 2"^ 
ed., Cold Spring Harbor Laboratory Press, Cold Spring Harbor, New York; Glover, D.M. 

15 (ed.), 1985, DNA Cloning: A Practical Approach, MRL?xcss, Ltd., Oxford, U.K. Vol. I, 
, n). Clones derived from genomic DNA may contain regulatory and intron DNA regions 
in addition to coding regions; clones derived froip cDNA will contain only exon 
sequence. Whatever the source, the gene should be cloned into a suitable vector for 
propagation of the gene. 

20 In the cloning of the gene from genomic DNA, DNA fragments are generated, 

some of which will encode the desired gene. The DNA may be cleaved at specific sites 
using various restriction enzymes. Alternatively, one may use DNASE in the presence of 
manganese to fragment the DNA, or the DNA can be physically sheared, as for example, 
by sonication. The linear DNA fragments can then be sq)arated according to size by 

25 standard techniques, including but not limited to, agarose and polyacrylamide gel 
electrophoresis and column chromatography. 

Once the DNA fixigments are generated, identification of the specific DNA 
firagment containing the desired gene may be accomplished in a number of ways. For 
example, if a gene encoding SEQ ID NO: 8 (of any species) or its specific RNA, or a 
30 fi:agment thereof is available and can be purified and labeled, tiie generated DNA 

fragments may be screened by nucleic acid hybridization to the labeled probe (Benton 
and Davis, 1977, Science 196:180; Grunstein & Hogness, 1975, Proc. Nafl; Acad. Sci. 
U.S.A. 72:3961). Those DNA fragments with substantial homology to the probe will 



151 



wo 2004/061616 PCT/US2003/041613 

hybridize. It is also possible to identify the appropriate fragment by restrictioii enzyme 
digestion(s) and comparison of fragment sizes with those expected according to a known 
restriction map if such is available. Further selection can be carried out on the basis of 
the properties of the gene. 

5 Alternatively, the presence of the gene may be detected by assays based on the 

physical, chemical, or immunological properties of its expressed product. For example, 
cDNA clones, or DNA clones that hybrid-select the proper mKNAs, c^n be selected that 
produce a protein having e.g., similar or identical electrophoretic migration, isoelectric 
focusing behavior, proteolytic digestion maps, kinase activity, inhibition of cell 
1 0 proliferation activity, substrate binding activity, or antigenic properties as known for SEQ 
E) NO: 8, If an antibody to SEQ ID NO: 8 is available, SEQ ID NO: 8 may be identified 
by binding of labeled antibody to the clone(s) putatively producing SEQ ID NO: 8 in an 
ELISA (enzyme-linked immunosorbent assay)-type procedure. 

Alternatives to isolating the genomic DNA encoding SEQ ID NO: 8 include, biit 
15 are not limited to, chemically synthesizing the gene sequence itself fix>m a known 

sequence or making cDNA to the mKNA which encodes SEQ ID NO: 8. Other methods 
are possible and within the scope of the invention. 

The identified and isolated gene can then be inserted into an appropriate cloning 
vector. A large number ofvector-host systems known in the art may be used. Possible 

20 vectors include, but are not limited to, plasmids or modified viruses, but the vector system 
must be compatible with the host cell used. Such vectors include, but are not limited to, 
bacteriophages such as lambda derivatives, or plasmids such as PBR322 or pUC plastnid 
derivatives or the Bluescript vector (Stratagene). The insertion into a cloning vector can, 
for example, be accomplished by ligating the DNA fragment into a cloning vector which 

25 has complementary cohesive termini. However, if the complementary restriction sites 
used to fragment the DNA are not present in the cloning vector, the ends of the DNA . 
molecules may be enzymatically modified. Alternatively, any site desured may be 
produced by ligating nucleotide sequences (linkers) onto the DNA termini; these ligated 
linkers ntiay con^rise specific chemically synthesized oligonucleotides encoding 

30 restriction endonuclease recognition sequences. In an alternative metiiod, the cleaved 
vector and SEQ ID NO: 8-encoding gene may be modified by homopolymeric tailing. 
Recombinant molecules can be introduced into host cells via transformation, transfection, 
infection, electroporation, etc., so that many copies of the gene sequence are generated. 

152 



wo 2004/061616 PCT/US2003/041613 

la an alternative method, the desired gene may be identified and isolated after 
insertion into a suitable cloning vector in a Ashotgun approach. Enrichment for the 
desired gene, for example, by size fractionization, can be done before insertion into the 
cloning vector. 

5 In specific embodiments, transformation of host cells with recombinant DNA 

molecules that incorporate the isolated SEQ ID NO: 8-encoding gene, cDNA, or 
synthesized DNA sequence enables generation of multiple copies of the gene. Thus, the 
gene may be obtained in large quantities by growing transformants, isolating the 
recombinant DNA molecules firom the transformants and, when necessary, retrieving the 
10 inserted gene fiom the isolated recombinant DNA. 

The nucleotide sequences encoding SEQ ID NO: 8 that are provided by the instant 
invention include those nucleotide sequences encoding substantially the same amino acid 
sequences as foimd in native SBQ ID NO: 8 proteins, and those encoded amino acid 
' sequences with functionally equivalent amino acids. Sequences suitable for hybridization 

15 to SEQ ID NO: 12, SEQ ID NO: 16, and SEQ ID NO: 20 may be obtained in a similar 
. fashion. 

5.18.3, PLACEMENT OF OBESITY RELATED GENES IN EXPRESSION 

VECTORS 

20 . The nucleotide sequence coding for SEQ ID NO: 8 or a functionally active 

fragment or other derivative thereof can be inserted into an appropriate expression 
vector, ue., a vector that contains the necessary elements for the transcription and 
translation of the inserted protein-coding sequence. The necessary transcriptional and 
translational signals can also be supplied by the native SEQ ID NO: 8 gene and/or its 

25 flanking regions. A variety of host-vector systems may be utilized to express the 

protein-coding sequence. These include, but are not limited to, mammalian cell systems 
infected with virus vaccima virus, adenovirus, etc.); insect cell systems infected 
with virus {e,g.^ baculovirus); microorganisms such as yeast containing yeast vectors, or 
bacteria transformed with bacteriophage, DNA, plasmid DNA, or cosmid DNA. 

30 The expression elements of vectors vary in their strengths and specificities. 

Depending on the host-vector system utilized, any one of a number of suitable 
transcription and translation elements may be used. In specific embodiments, the gene 
encoded by SEQ ID NO: 2 and SEQ ID NO: 3 is expressed, or a sequence encoding a 

153 



wo 2004/061616 PCT/US2003/041613 

functionally active portion of human SEQ ID NO: 8 encoded by one of these genes is 
expressed. In yet another embodiment, a firagment of SEQ ID NO: 8 comprising a 
domain of the particular protein is expressed. 

Any of the methods previously described for the insertion of DNA fragments into 
5 a vector may be used to construct expression vectors containing a chimeric gene 
consisting of appropriate transcriptional/translational control signals and the protein 
coding sequences. These methods may include in vitro recombinant DNA and synthetic 
techniques and in vivo recombinants (genetic recombination). Expression of nucleic acid 
sequence encoding SET ID NO: 8 or peptide fragment thereof may be regulated by a 

10 second nucleic acid sequence so that the SEQ ID NO: 8 or peptide fragment thereof is 
e7q)ressed in a host transformed with the recombinant DNA molecule. For example, 
egression of SEQ ID NO: 8 may be controlled by any promoter/enhancer element 
known in the art. In a specific embodiment, the promoter is not a native promoter of the 
specific SEQ ID NO: 8-encoding gene. Promoters that may be used to control expression 

15 of SEQ ID NO: 8-encoding genes include, but are not limited to, the SV40 early promoter 
region (Bemoist and Chambon, 1981, Nature 290:304-310), the promoter contained in the 
3' long teraiinal repeat of Rous sarcoma virus (Y amamoto et aL, 1980, Cell 22:787-797), 
the heipes thymidine kinase promoter (Wagner et al., 1981, Proc. Natl. Acad. Sci. U.S.A. 
78: 1441-1445), the regulatory sequences of the metallothionein gene (Brinster et aL, 

20 1982, Nature 296:39-42); prokaryotic expression vectors such as the b-lactamase 

promoter (Villa-Kamaroff a/., 1978, Proc. Natl. Acad. Sci. U.S.A. 75: 3727-3731), or . 
the tat promoter (DeBoer et al.^ 1983, Proc. Natl. Acad. Sci. U.S.A. 80: 21-25); see also 
*TJseful proteins fi-om recombinant bacteria", Scientific American, 242: 74-94 (1980); or 
the cauliflower mosaic virus 35s RNA promoter (Gardner et aL, 1981, NucL Acids Res. 

25 9:2871), and the promoter of the photosynthetic enzyme ribulose biphosphate carboxylase 
(Herrera-EstreUa et aL^ 1984, Nature 310:115-120); promoter elements from yeast or 
other fimgi such as the Gal4 promoter, the ADC (alcohol dehydrogenase) promoter, PGK - 
(pho^hoglycerol kinase) promoter, alkahne phosphatase promoter, and the following 
animal transcriptional control regions, which exhibit tissue specificity and have been 

30 utilized in transgenic animals: elastase I gene control region active vi pancreatic acinar 

• ' , . 

cells (Swift et al, 1984, Cell 38: 639-646; Omitz et a/., 1986, Cold Spring Harbor Symp. 
Quant. Biol. 50: 399-409; MacDonald, 1987, Hepatology 7:425-515); insulin gene control 
region active in pancreatic beta cells (Hanahan, 1985, Nature 315:1 15-122), 

immunoglobulin gene control region active in lymphoid cells (Grosschedl et aL, 1984, 

■ 

154 



wo 2004/061616 PCT/US2003/041613 

Cell 38: 647-658); Adames et a/., 1985, Nature 318:533-538); Alexander et aL, 1987, 
Mol. Cell. Biol. 7:1436-1444), mouse maimnary tumor virus control region active in 
testicular, breast, lymphoid and mast cells (Leder et aL, 1986, Cell 45:485-495), albumin 
gene control region active in liver (Pinkert et aL, 1987, Genes and DeveL 1 : 268-276), 
5 alpha-fetoprotein g^e control region active in liver (Kruinlauf et a/., 1985, Mol*. Cell. 
Biol. 5: 1639-1648); Hammer et al, 1987, Science 235:53-58); alpha l-antitrypsin gene 
control region active in the Uver (Kelsey et aL, 1987, Genes and Devel. 1:161-171), 
beta-globin gene control region active in myeloid cells (Mogram et aL, 1985, Nature 
315:338-340; Kollias etaL, 1986, Cell 46:89-94; myelin basic protein gene control region 
10 active in oligodendrocyte cells in the brain (Readhead et aL, 1987, Cell 48: 703-712); 
myosin light chain-2 gene control region active m skeletal muscle (Sani, 1985, Nature 
314:283-286), and gonadotropic releasing hormone gene control region active in the 
hypothalamus (Mason et aL, 1986, Science 234: 1372-1378). 

In a specific embodiment, a vector is used that comprises a promoter operably 
15 linked to a SEQ ID NO: 8-encoding nucleic acid, one or more origms of replication, and, 
' optionally, one or more selectable markers an antibiotic resistance gene). In a 
specific embodiment, an expression construct is made by subcloning the coding sequence 
from a SEQ ID NO: 8 encoding gene into the EcpRI restriction site of each of &e ttiree 
pGEX vectors (Glutathione S-Transferase e3q)ression vectors; Smith and Johnson, 1988, 
20 Gene 7: 31-40). This allows for the expression of the SEQ ID NO: 8 product from the 
subclone in the correct reading firame. . 

Expression vectors containing SEQ ID NO: 8-encoding gene inserts can be 
identified by three general approaches: (a) nucleic acid hybridization, (b) presence or 
absence of marker gene functions, and (c) expression of inserted sequences. In the first 

25 approach, the pres^ce of a SEQ ID NO: 8-encoding gene inserted in an expression 

vector can be detected by nucleic acid hybridization using probes comprising sequences 
that are homologous to an inserted SEQ ID NO: 8-encoding gene. In the second 
qiproach, the recombinant vector/host system can be identified and selected based upon 
the presence or absence of certain "marker" gene fimctions {e.g., thymidine kinase 

30 activity, resistance to antibiotics, transformation pheriotype, occlusion body formation in 
baculovirus, etc.) caused by the insertion of a SEQ ID NO: 8 encoding gene in &e vector. 
For example, if the SEQ ID NO: 8-encoding gene is inserted within the marker gene 
sequ^ce of the vector, recombinants containing the insert can be identified by the 
absence of the marker gene fimction. In the third approach, recombinant expression 

155 



V 



wo 2004/061616 PCTAJS2003/041613 

vectors can be identified by assaying the specific SEQ ID NO: 8 product expressed by the 
recombinant Such assays can be based, for example, on the physical or functional 
properties of SEQ ID NO: 8 in in vitro assay systems, kinase activity, binding with 
antibodies directed to SEQ ID NO: 8, or inhibition of cell function(s) performed, 
5 facilitated or affected by SEQ ID NO: 8, / 

Once a particular recombinant DNA molecule is identified and isolated, several 
methods known in the art may be used to propagate it Once a suitable host system and 
growth conditions are established, recombinant expression vectors can be propagated and 
prepared in quantity. As previously explained, the expression vectors that can be used 
10 include, but are not limited to, the following vectors or their derivatives: human or animal 
viruses such as vaccinia virus or adenovirus; insect viruses such as baculovirus; yeast 
vectors; bacteriophage vectors (^.g-., lambda), and plasmid and cosmid DNA vectors. 

In addition, a host cell strain may be chosen that modulates the e^qpression of the 
inserted sequences, or modifies and processes the gene product in the specific fashion 
1 5 desired Expression firom certain promoters can be elevated m the presence of certain 
inducers; thus, expression of the genetically engineered SEQ ID NO: 8 may be controlled. 
' Furthennore, different host cells have characteristic and specific mechanisms for the 
translational and post-translational processing and' modification (e.g., glycosylation, 
' phosphorylation of proteins. Appropriate cell lines or host systems can be chosen to 
20 ensure the desired modification and processing of the foreign protein expressed. For 
example, expression in a bacterial system can be used to produce an unglycosylated core 
protein product. Expression in yeast will produce a glycosylated product. Expression in 
Tp^Tp^alian cells can be used to ensure native glycosylation of a heterologous protein. 
Furthermore, differmt vector/host expression systems may affect processing reactions to 
25 different degrees. 

In other specific embodiments, tiie SEQ ID NO: 8, or fiagment or derivative 
thereoj^ may be expressed as a fiision, or chimeric protein product, comprising the 
protein, fi:agment or derivative joined via a peptide bond to a protein sequence derived 
firom a diffwent protein. Such a chimeric product can be made by ligating the appropriate 

ft 

30 nucleic acid sequences encoding the desired amino acid sequences to each other by 
methods known in the art, in tiie proper coding firame, and expressing the chimeric 
product by methods commonly known in the art In one embodiment, tiierefore, the 
invention includes an isolated nucleic acid comprising a sequence of at least 10 



156 



wo 2004/061616 PCT/US2003/041613 

nucleotides encoding a chimeric SEQ ID NO: 8, wherein the chimeric SEQ ID NO: 8 
displays at least one of the ftinctional activities of the wild-type SEQ ID NO; 8, and at 
least one non-SEQ ID NO: 8 functional activity. Alternatively, such a chimeric product 
may be made by protein synthetic techniques, e,g.y by use of a peptide synthesizer. 

5 A person of skill in the art will ^reciate that cDNA, genomic, and synthesized 

sequences can be cloned and expressed. One way to accomplish such expression is by 
transferring a SEQ ID NO: 8 encoding gene or jfragment thereof to cells in tissue culture. 
The expression of the transferred gene may be controlled by its native promoter, or can be 
controlled by a non-native promoter. In addition to transferring a nucleic acid comprising 

10 a nucleic acid sequence encoding an entire SEQ ID NO: 8 (i.e., equivalent to the wild 

type), the transferred nucleic acids can encode a functional portion of SEQ ID NO: 8, or a 
protein having at least 60% sequence identity to SEQ ID NO: 8 disclosed herein, as 
compared over the length SEQ ID NO: 8, or a polypeptide having at least 60% sequence 
similarity to a SEQ ID NO: 8 fragment, as compared over the length of the SEQ ID NO: 8 

1 5 fragment. Introduction of the nucleic acid into the ceU is accomplished by such methods 
as electroporation, lipofection, calcium phosphate mediated transfection, or viral 
infectioa Usually, the method of transfer includes the transfer of a selectable marker to . 
the cells. The cells are then placed under selection to isolate those cells that have taken ■ 
up and are expressing the transferred gene. The expressed SEQ ID NO: 8 or fragments 

20 tiiereof are isolated and purified as described below. SEQ ID NO: 12, SEQ ID NO: 1 6, •■ 
and SEQ ID NO: 20 may be manipulated in a similar fashion. 

5.18.4. PURIFICATION OF OiBESITY RELATED GENE PRODUCTS 

In particular aspects, the invention provides amino acid sequ^ces of SEQ ID NO: 
25 8, and fragments and derivatives thereof that comprise an antigenic determinant can 
be recognized by an antibody) or which are otherwise fimctionally active, as well as 
nucleic acid sequmces encoding the foregoing, A functionally active SEQ ID NO: 8 
material as used herein refers to that material displaying one or more known functional 
activities associated with a full-length (wild-type) SEQ ID NO: 8. 

30 In specific anbodiments, the invention provides fragments of SEQ ID NO: 8 

consisting of at least 6 amino acids, at least 10 amino acids, or at least 50 amino acids. In 
other embodiments, the proteins comprise or consist essentially of a functional domain of 
SEQ ID NO: 8. Nucleic acids mcoding the foregoing are also provided 



157 



WO 2004/061616 PCT/US2003/041613 

Once a recombinant feat expresses the SEQ ID NO: S-encoding gene sequence, or 
part thereof, is identified, the resulting product can be analyzed. This analysis is achieved 
by assays based on the physical or functional properties of the product, including 
radioactive labeling of the product followed by analysis by gel electrophoresis, 
5 immunoassay, etc. Once SEQ ID NO: 8, or a firagment thereof, is identified, it may be 
isolated and purified by standard methods including chromatography ion exchange, 
affinity, and sizing colunm chromatography), CCTtrifiigation, differential solubility, or by 
any other standard technique for the purification of proteins. The functional properties 
may be evaluated using any suitable assay. 

10 . Alternatively, once SEQ ID NO: 8 produced by a recombinant is identified, the 

amino acid sequence of the protein can be deduced &om the nucleotide sequence of the 
chimeric gene contained in the recombinant. As t result, the protein can be synthesized 
by standard chemical methods known in the art (e.g., see Hunkapiller et al^ 1984, Nature 
310:105411). 

15 In another alternate embodiment, native SEQ ID NO: 8 protein is purified from 

natural sources, by standard methods such as those described above immunoafiSnity 
purification). 

In a specific embodiment of the present invention, SEQ ID NO: 8, whether 

• - 

produced by recombinant DNA techniques or by chemical synthetic methods or by 
20 purification of native proteins, include but are not limited to those containing, as a 
primary amino acid sequence, all or part of the amino acid sequence substantially as 
depicted in Fig. 31 as well as firagments and other derivatives thereof^ including proteins 
homologous thereto. SEQ ID NO: 12, SEQ ID NO: 16, and SEQ ID NO: 20 may be 
purified iu a similar fashion. 

25 ' One embodiment of the present invention provides a purified protein comprising 

the amino acid sequence of SEQ ID NO: 8. Another embodiment of the present invention 
provides a purified protein encoded by a nucleic add hybridizable under conditions of 
low stringency (see Section 5.18^.1), high stringency (see Section 5.18.2.2), or moderate 
stringency (see Section 5.18.2.3) to a DNA haviug a sequence consisting of the coding 

30 region of SEQ ID NO: 2. Still another embodiment of the present invention provides a 
purified proteia con:qprising an amino acid sequence that has at least 60% identity to the 
amino acid sequence set forth in SEQ ID NO: 8, in which percentage identity is 
deteraiined over an amino acid sequence of identical size as SEQ ID NO: 8. Yet another 

158 



wo 2004/061616 PCT/nS2003/041613 

embodiment of the present invention provides a purified protein comprising an amino 
acid sequence that has at least 90% identity to the amino acid sequence set forth in SEQ 
ID NO: 8, in which percentage identity is detennined over an amino acid sequence of 
identical size as S£Q ID NO: 8. Some embodiments of the present invention provide an 
5 isolated nucleic acid comprising the nucleotide sequCTice of SEQ ID NO: 2, a coding 
region of SEQ ID NO: 2, SEQ ID NO: 3, a coding region of SEQ ID NO: 3, or the 
complement of any of the foregoing. In some embodiments, the isolated nucleic acid is a 
DNA. Some embodiments of the present invention provides an isolated nucleic acid 
comprising a nucleotide sequence encoding the protein of any protein described in this 
10 section, or the complement thereof. 

5.18.5. STRUCTURE OF OBESITY RELATED GENES AND ENCODED 

PROTEINS 

The structure of the obesity related geaes of the present invention and the obesity 
1 S related protein products of the present invention can be analyzed by various methods ' 
known in the art, as described in this section. 

5.18.5.1. GENETIC ANALYSIS 
The cloned DNA or cDNA coire^onding to an obesity related gene can be 

20 . analyzed by methods including, but not lirnited to, Southrai hybridization (Southern, 
. 1975, J. MoL Biol. 98: 503-517), Northern hybridization (see e.g.. Freeman et ah, 1983, 
Proc. Natl. Acad. Sci. U.S.A. 80: 4094-4098), restriction endonuclease mapping 
(Maniatis, 1982, Molecular Clonings A Laboratory, Cold Spring Harbor, New York), and 
DNA sequence analysis. Polymerase chain reaction (PGR; U.S. Patent Nos. 4,683,202, 

25 4,683,195 and 4,889,8 18; Gyllenstein et a/., 1988, Proc. Natl. Acad. Sci. U.S.A. 85: 

■ 

7652-7656; Ochman etal, 1988 Genetics 120:621-623; Loh et al, 1989, Science 243: 
217-220) followed by Southern hybridization with a probe specific to one of the obesity 
related genes can allow the detection of that particular obesity related gene in DNA firom 
various cell types from various vertebrate sources. Methods of amplification other than 
30 PGR are cormnonly known and can also be employed. In one embodiment, Southem 
hybridization can be used to determine the genetic linkage of a particular obesity related 
gene. Northern hybridization analysis can be used to determine the expression of a 
particular obesity related gene. Various cell types, at various states of development or 



159 



wo 2004/061616 PCTAJS2003/041613 

activity can be tested for expression of a particular obesity related gene. In one preferred 
embodiment, screening arrays comprising probes homologous to the exons of particular 
obesity related genes are used to determine the state of expression of these genes, or 
specific exons of these genes, in various cell types, under particular environmental or 
5 perturbance conditions, or in various vertebrates. The stringency of the hybridization 
conditions for both Southern and Northern hybridization can be manipulated to ensure 
detection of nucleic acids with the desired degree of relatedness to the specific probe 
used. Modifications of these methods and other methods commonly known in the art can 
be used. 

10 Restriction endonuclease mapping can be used to roughly determine the genetic 

structure of an obesity related gene. Restriction maps derived by restriction endonuclease 
cleavage can be confirmed by DNA sequence analysis. The genetic structure of an 
obesity related gene can also be determined using scanning oligonucleotide arrays, 
wherein the expression of one exon is correlated with the expression of a plurality of 

15 neighboring exons, such that the correlation indicates the correlated exons are contained 
within the same gene. The structure so determined can be confirmed by PGR- 

DNA sequence analysis can be performed by any techniques known in the art, 
including but not limited to flie method of Maxam and Gilb^t, 1980, Meth. Enzymol. 65: 
, 499-5601, the Sanger dideoxy method (Sanger et al, 1977, Proc. Natl, Acad. Sci. U.S A. 
20 74: 5463), the use of T7 DNA polymerase (Tabor & Richardson, U.S. Patent No. 

4,795,699), or use of an automated DNA Sequenator (ag.. Applied Biosystems, Foster 
City, CA). The sequencing method may use radioactive or fluorescent labels. 

5.18.5.2. PROTEIN ANALYSIS 

25 The amino acid sequence of a particular obesity related goie product can be 

derived by deduction fix>m the DNA sequence, or alternatively, by direct sequencing of 
tile protein, e.g., with an automated amino acid sequencer. The protein sequence of an 
obesity related gene product can be characterized by a hydrophilicity analysis (Hopp and 
Woods, 1981, Proc. Natl. Acad. Sci. U.S.A. 78: 3824). A hydrophilicity profile is used to 

30 identify the hydrophobic and hydrophihc regions of an obesity related gene product ajtid 
the corresponding regions of the gene sequence that encode such regions. 

Secondary structural analysis (Chou and Fasman, 1974, Biochemistry 13: 222) 
can also be done, to identify regions of particular obesity related gene products that 



160 



wo 2004/061616 PCT/US2003/041613 

assume specific secondary structures, such as (x-helices and jS-pleated sheets. 
Manipulation, translation, secondary structure prediction, open reading firame prediction 
and plotting, as well as determination of sequence homologies, can also be accomplished 
using computer software programs and nucleotide and protein sequence databases 
5 available in the art. Protein and/or nucleotide sequence homologies to known proteins or 
DNA sequences can be used to deduce the likely function of a particular TCAP, or 
domains thereof. 

Other methods of structural analysis can also be employed. These include but are 
not limited to X-ray crystallography (Engstom, 1974, Biochem, Exp. Biol. 11: 7-13) and 
10 computer modeling (Fletterick, and Zoller, (eds.). Computer Graphics and Molecular 
Modeling, in Current Commtmications in Molecular Biology, Cold Spring Harbor 
Laboratory, Cold Spring Haibor, New York (1 986)), 

« 

hi addition to detemiinations of obesity related gene product protein structure, the 
invention provides methods of identifying a molecule that specifically binds to a Ugand 
15 selected fix)m the group consisting of an obesity related gene product, an obesity related 

i 

gene product firagmmt, a domain of an obesity related protein, and a nucleic acid 
* encoding the obesity related protein or fi*agment hereof, comprising (a) contacting the 

ligand with a plurality of test molecules under conditions conducive to binding between 
• the Ugand and the molecules; and (b) identifying a molecule \yithin the pluraHty that 
20 specifically binds to the Ugand. 



5.18.6. OBESITY RELATED GENE PRODUCT ANTIBODIES 

In a specific embodiment, the modulator of an obesity related gene disclosed in 
Section 6.7.5 is an antibody that specifically binds (^.e, is not competed off of the obesity 
25 related gene produc:t by a non-specific protein such as bovine serum albumin) the obesity 
' related gene product or an active fi-agment or analog thereof and ii}hibits the function of 
. the obesity related gene product. 

Antibodies of the invention include, but are not Umited to, monoclonal antibodies, 
multispecific antibodies, human antibodies, humanized antibodies, chimeric antibodies,. 
30 singje-cham Fvs (scFv), sirigje cham antibodies, Fab fiagments, F(abO fi^ 

disulfide-Unked Fvs (sdFv), and anti-idiotypic (anti-Id) antibodies (mcluding, e.g.; anti-Id 
antibodies to antibodies of the invention), and epitope-binding firagments of any of the 
above. In particular, antibodies of the present invention include immunoglobulin 

161 



wo 2004/061616 PCT/US2003/041613 

molecules and immunologically active portions of immunoglobulin molecules, 2.e, 
molecules that contain an antigen binding site that inmnmospecifically binds to 
osteopontin. The immunoglobulin molecules of the invention can be of any type (e,g. » 
IgG, IgE, IgM, IgD, IgA and IgY), class (eg., IgGu IgG2, IgGa, IgG4, IgAi and IgA2) or 
S . subclass of immuno globulin molecule. 

The antibodies of the invention may be from any animal origin including birds 
and mammals (e.g., human, murine, donkey, sheep, rabbit, goat, guinea pig, camel, horse, 
or chicken). Preferably, the antibodies of the invention are human or humanized 
monoclonal antibodies. As used herein, 'liuman" antibodies include antibodies having 
10 the amino acid sequence of a human immunoglobulin and include antibodies isolated 

The antibodies of the present invention may be monospecific, bispecific, 
trispecific or of greater multispeciticity. Multispecific antibodies may be specific for 
different epitopes of an osteopontin polypeptide or may be specific for both an 
IS osteopontin polypeptide as well as for a het^ologous epitope, such as a heterologous 
polypeptide or soUd support material. See, e.g.y PCT publications WO 93/17715; WO 
92/08802; WO 91/00360; WO 92/05793; Tutt, et al., J. hmnunoL 147:60-69(1991); U.S. 
. Patent Nos. 4,474,893; 4,714,681; 4,925,648; 5,573,920; 5,601,819;.Kostehiy et al., J. 
Immunol. 148:1547-1553 (1992). 

20 The antibodies of the invention include derivatives that are modified, e.g., by the 

covalent attachment of any type of molecule to the antibody such that covalent 
attachment. For example, but not by way of limitation, the antibody derivatives include 
antibodies that have been modified, e.g., by glycosylatton, acetylation, pegylation, 
phosphqiylation, amidation, derivatization by known protecting/blocking groups, 

25 , proteolytic cleavage, linkage to a cellular ligand or other protein, etc. Any of numerous 
chemical modifications may be carried out by known techniques, including, but not 
limited to, specific chemical cleavage, acetylation, formylation, metaboUc synthesis of 
tunicamycin, etc. Additionally, the derivative may contain one or more non*K)lassical 
amino acids. For example, antibodies of the present invention may be recombinantiy 

30 fiised or conjugated to molecules usefiil as labels in detection assays and effector 

molecules such as heterologous polypeptides, drugs, radionuclides, or toxins. See, 0.g., 
PCT pubUcations WO 92/08495; WO 91/14438; WO 89/12624; U.S. Patent No. . 
5,314,995; and EP 396,387. 



162 



wo 2004/061616 PCT/US2003/041613 

The present invention encompasses antibodies or fragments thereof recombinantly 
fused or chemically conjugated (including both covalently and non-covalently 
conjugations) to a heterologous polypeptide (or portion thereof, preferably at least 10, at 
least 20, at least 30, at least 40, at least SO, at least 60, at least 70, at least 80, at least 90 or 
S at least 100 amino acids of the polypeptide) to generate fusion proteins. The fiision does 
not necessarily need to be direct, but may occur through linker sequences. For example, 
antibodies may be used to target heterologous polypeptides to particular cell types either 
in vitro or in vivo, by fusing or conjugating the antibodies to antibodies specific for 
particular cell surface receptors. Antibodies fixsed or conjugated to heterologous 
10 polypeptides may also be used in in vitro immunoassays and purification methods using 
methods knovm in the art. See e.g.. Harbor et ai, supra, and PCT publication WO 93/2 
1232; EP 439,095; Naramura etal, Immunol. Lett 39:91-99 (1994); U.S. Patent 
5,474,981; Gillies et al., PNAS 89:1428-1432 (1992); Fell et al., J. LnmunoL 146:2446- 
2452(1991), which are incorporated by reference in their entireties. 

I 

1 5 The present invention further includes compositions comprising heterologous 

polypeptides fused or conjugated to antibody fragments. For example, the heterologous 
polypeptides may be fused or conjugated to a Fab. fragment, Fd fiagment, Fv fragment, 
F(ab)2 fragment, or portion thereof. Methods for fusing or conjugating polypeptides to 
antibody portions are known in the art. See, e.g., U.S. Patent Nos. 5,336,603; 5,622,929; 

20 5,359,046; 5,349,053; 5,447,851; 5,1 12,946; EP 307,434; EP 367,166; PCT pubUcations 
WO 96/04388; WO 9 1/06570; Ashkenazi et al., Proc. Natl. Acad. Sci. USA 88: 10535- 
10539 (1991); Zheng et al., J. Immunol. 154:5590-5600 (1995); and Vil et o/., Piw. Natl, 
Acad. Sci. USA 89:1 1337- 1 1341(1992), which are hereby incorporated by reference in 
their entireties. 

25 Additional fusion proteins of the invention may be generated through the 

techniques of gene-shufOing, motif-shufQing, exon-shufEling, and/or codon-shuflfling 
(collectively referred to as 'TDNA shuffling"). DNA shuffling may be employed to alter 
the activities of antibodies of the invention or fragments thereof (e.g., antibodies or 
fragments tiiereof with higher afOnities aad lower dissociation rates). See, generally, 

30 U.S. PatentNos. 5,605,793; 5,811,238; 5,830,721; 5,834,252; and 5,8.37,458, andPatten 
etal., 1997, Curr. Opinion Biotechnol. 8:724-33; Harayama, 1998, Trends Biotechnol. 
16(2):76-82; Hansson et al., 1999, J. Mol. Biol 287:265-76; and Lorenzo and 1998, 
Blasco, Biotechniques 24(2):308- 13, which are hereby incorporated by reference in their 
entireties. 

163 



wo 2004/061616 PCT/US2003/041613 

Li one embodiment, antibodies or fi:agments thereof, or the encoded antibodies or 
fragments thereof, are altered by subjecting them to random mutagenesis by error-prone 
PGR, random nucleotide insertion or other methods, prior to recombination. In another 
embodiment, one or more portions of a polynucleotide encoding an antibody or antibody 
5 fragment, which portions immunospecifically bind to an osteopontin antigen may be 
recombined with one or more conoponents, moti&, sections, parts, domains, fragments, 
etc, of one or more heterologous molecules. 

Moreover, the antibodies of the present invention or fragments thereof can be 
frised to mark^ sequences, such as a peptide, to facilitate purification, in preferred 

10 embodiments, the marker amino acid sequence is a hexa-histidine peptide, such as the tag 
provided in a pQE vector (QIAGEN, Inc., 9259 Eton Avenue, Chatsworth, CA, 91311), 
among others, many ofwhich are commercially available. As described in Gentz a/., 
1989, Pfoc. Natl. Acad. Sci. USA 86:821-824 for instance, hexa-histidine provides for 
convenient purification of the frision protdn. Other peptide tags usefrd for purification 

IS include, but are not limited to, the heniagglutinitfUA" tag, which corresponds to an 
epitope derived from the influenza hemagglutinin proteki (WUson et al.^ 1984, Cell 
37:767) and the ''flag" tag. 

■ * 

. An antibody or fragment thereof may be conjugated to a therapeutic moiety such* 

* 

as a cytotoxin, e.g,^ a cytostatic or cytocidal agent, a ther^eiitic agent or a radioactive 
20 metal ion, e.g,, alpha-emitters. A cytotoxin or cytotoxic ageiit includes any agent that is 
detrimental to cells. Examples include pachtaxol, cytochalasin B, gramicidin D, ethidium 
bromide, emetine, mitomycin, etoposide, tenoposide, vincristine, vinblastine, colchicin, 
doxorubicin, daunorubicin, dihydioxy anthracin dione, mitoxantrone, mithramycin, 
actinomycin D, 1-dehydrotestosterone, glucocorticoids, procaine, tetracaine, hdocaine, 
25 propranolol, and puromycin and analogs or homologs thereof. Therapeutic agents 

include, but are not limited to, antimetabolites methotrexate, 6-merc2^topurine, 6- 
thioguanine, cytarabine, 5-fiuorouracil decarbazine), alkylating agents (e.g., 
mechlorethamine, thioepa chlorambucil, melphalan, carmustine (BSNU) and lomustine 
(CCNU), cyclothosphamide, busulfan, dibromomannitol, streptozotocin, mitomycin C, 
30 and cisdichlorodiamine platinum (BO (DDP) cisplatin), authracyclines (eg., daunorubicin 
(formerly daunomycin) and doxorubicin), antibiotics (6.g., dactinomycin (formerly 
actinomycin), bleomycin, mithramycin, and antfaramycin (AMC))i, and anti-mitotic agmts 
(e.g., vincristine and vinblastine). 

164 



wo 2004/061616 PCT/US2003/041613 

Further, an antibody or fragment thereof may be conjugated to a therapeutic agent 
or drug moiety that modifies a given biological response. Therapeutic agents or drug 
moieties are not to be construed as limited to classical chemical therapeutic agents. For 
example, the drug moiety may be a protein or polypeptide possessing a desired biological 
5 activity. Such proteins may include, for example, a toxin such as abiin, ricin A, 
pseudomonas exotoxin, or diphtheria toxm; a protein such as tumor necrosis factor, 
interferon, jft-interferon, nerve growth factor, platelet derived growth factor, tissue 
. plasminogen activator, an apoptotic agent, e.g., TNF-0£, TNF-iS, AIM I (see, Memational 
Publication No. WO 97/33899), AIM H (see. International Publication No. WO 

10 97/3491 1), Fas Ligand (Takahashi et al., 1994, L Iminunol., 6:1567-1574), and VEGI 
(see, Ihtemational Publication No. WO. 99/23 1 05), a thrombotic agent or an anti- 
angiogenic agent, e.g., angiostatin or endostatin; or, a biological response modifier such 
as, for example, alymphokine (e.g.^ interleukin-1 QTLr TO, interleukin-2 ("ILrZ'*), 
interleukin-6 ("IL-6")» granulocyte macrophage colony stimulating factor ("GM-CSF")> 

15 and granulocyte colony stimulating factor ("G-CSF"), or a growth factor (e.g., growth 

■ * * 

honnone C'GH")). 

■ • 

Techniques for conjugating such therapeutic moiety to antibodies are well known, 
see, Amon et al^ *TVIonoclonal Antibodies For Immxmqtargeting Of Drugs In Cancer 
Therapy", 'm Monoclonal Antibodies And Cancer Therapy^ Reisfeld et al (eds.)» PP- 243- . 
20 56 (Alan R. Liss, Inc. 1 985); Hellstrom et al., "Antibodies For Drug Deliver/*, in 

. Controlled Drug Delivery {T^ Ed.), Robinson et al, (eds.), pp. 623-53 (Marcel Dekker, 
Inc. 1987); Thorpe, "Antibody Carriers Of Cytotoxic Agents In Cancer Therapy: A 
Review'*, in Monoclonal Antibodies '84: Biological And Clinical Applications, Pinchera 
' . et al. (eds.), pp. 475-506 ^iP<?j;; "Analysis, Results, And Future Prospective Of T^^ 
25 Therapeutic Use Of Radiolabeled Antibody In Cancer Therapy'*, in Monoclonal 
Antibodies For Cancer Detection And Therapy^ Baldwin et al (eds.), pp. 303-16 

« 4 

(Academic Press 1985), and Thorpe eial, 1982, Lnmunol. Rev. 62:119-58. 

An antibody or fragmmt thereof with or without a therapeutic moiety conjugated 
to it, administered alone or in combination with cytotoxic factor(s) and/or cytokine(s) can 
30 be used as a therapeutic. 

Alternatively, an antibody can be conjugated to a second antibody to form an 
antibody heteroconjugate as described by Segal in U.S. Patent No. 4,676,980, which is 
incorporated herein by reference in its entirety. 



165 



wo 2004/061616 PCT/US2003/041613 

Antibodies may also be attached to solid supports, which are particularly useful 
for immunoassays or purification of the target antigen. Such sohd supports include, but 
are not limited to, glass, cellulose, polyacrylamide, nylon, polystyrene, polyvinyl chloride 
or polypropylene. 

S Antibodies of the present invention or firagments th^eof may be characterized in a 

variety of ways. In particular, antibodies of the invention or fragments thereof may be 
assayed for the abiUty to immunospecifically bind to obesity related gene products of the 
present invention. Such an assay may be performed in solution (e.g. , Houghten, 1 992, 
Bio/Techniques 13:412-421), on beads (Lam, 1991, Nature 354:82-84), on chips (Fodor, 

10 1993, Nature 364:555-556), on bacteria (U.S. Patent No. 5;223,409), on spores (U.S. 
Patent Nos. 5,571,698; 5,403,484; and 5,223,409), onplasmids (Cull etaL, 1992, Proc. 
Nad. Acad. Sci. USA 89:1865-1869) or on phage (Scott and Smith, 1990, Science 
249:386-390; Devlin, 1990, Science 249:404-406; Cwirla at al, 1990, Proc. Nad. Acad. 
Sci. USA 87:6378-6382; andFehci, 1991, J. Mol. Biol. 222:301-310) (each of these 

1 5 references is incorporated herein in its entirety by reference). Antibodies or fragments 
thereof that have been identified to immunospecifically bind to the obesity related gene 
products, of the present invention or a fragment thereof can then be assayed for their 
specificity and afiBnity for the obesity related gene products of the present invention. 

The antibodies of the invention or fi:agments thereof inay be assayed for 
20 immunospecific binding to the obesity related g^e products of the present invention and 
cross-reactivity with other antigens by any method known in the art. Immunoassays that 
can be used to analyze immunospecific binding and cross-reactivity include, but are not 
limited to, competitive and non-competitive assay systems using techniques such as 
westem blots, radioimmunoassays, ELIS A (enzyme linked immunosorbent assay), 
25 "sandwich" immimoassays, immunoprecipitation assays, precipitin reactions, gel 
difiusion precipitin reactions, immunodiffusion assays, agglutination assays, 
complement-fixation assays, immunoradiometric assays, fluorescent immunoassays, 
protein A immunoassays, to name but a few. Such assays are routine and well known in 
the art (see, e.^., Ausubel et al, eds, 1994, Current Protocols in Molecular Biology^ Vol. 

30 1, John Wiley & Sons, Inc., New Yoik, which is incorporated by reference h^ein in its . 

... • • * 

entirety). Exemplary immunoassays are described briefly below (but are not intmded by 
way of limitation). 



166 



wo 2004/061616 PCT/US2003/041613 

Immunoprecipitation protocols generally coixiprise lysiag a population of cells in a 
. lysis buffer such as RIPA buffer (1% NP-40 or Triton X-100, 1% sodium deoxycholate, 
0.1% SDS, 0.15 M NaCl, 0.01 M sodium phosphate at pH 7.2, 1% Trasylol) 
supplemented with protein phosphatase and/or protease inhibitors (e.g., EDTA, PMSF, 

5 aprotinin, sodium vanadate), adding the antibody of interest to tiie cell lysate, incubating 
for a period of time (e.g., one to four hours) at 40° C, adding protein A and/or protein G 
sepharose beads to the cell lysate, incubating for about an hour or more at 40°C, washing 
the beads in lysis buffer and resuspending the beads in SDS/sample buffer. The ability of 
the antibody of interest to immunoprecipitate a particular antigen can be assessed by, e.g. , 

10 western blot analysis. One of skill in the art would be knowledgeable as to the 

parameters that can be modified to increase the binding of the antibody to an antigen and 
decrease flie background (e.g^.,pre-clearing the ceU lysate with sepharose beads). For 
further discussion regarding immimoprecipitation protocols see, e.g., Ausubel et al, eds, 
1994, Current Protocols in Molecular Biology^ Vol. 1, John Wiley & Sons, Inc., New 

15 York at .10.16.1. 

Western blot analysis goaerally comprises prq)aring protein samples, 
• electrophoresis of the protein samples in a polyacrylamide gel (eg., 8%- 20% SDS- 
PAGE depending on the molecular weight of the antigen), transferring the protein sample . 
from the polyacrylamide gel to a membrane such as nitrocellulose, P VDF or nylon, 

20 blocking the membrane in blocking solution (e.g. , PBS with 3% BS A or non-fat milk), 
waishing the membrane in washing buffer (e.g. , PBS-Tween 20), blocking the membrane 
with primary antibody (the antibody of interest) diluted in blocking buffer, washing the 
membrane in washing buffer, blocking the membrane with a secondary antibody (which 
recognizes the primary antibody, 6.g., an anti-human antibody) conjugated to an 

25 enzymatic sxibstrate (e.g. , horseradish peroxidase or alkaline phosphatase) or radioactive 
molecule (eg. , ^^P or ^^I) diluted in blocking buffer, washing the membrane in wash 
buffer, and detecting tiie presence of the antigen. One of skill in the art would be 
knowledgeable as to the parameters that can be modified to increase the signal detected 
and to reduce the background noise. For further discussion regarding western blot 

30 protocols see, e.g., Ausubel et al, eds, 1994, Current Protocols in Molecular Biology, Vol. 
.1, John Wiley & Sons, Inc., New York at 10.8.1. 

ELIS As comprise preparing antigen, coating the well of a 96 well microtiter plate 
with the antigen, adding the antibody of interest conjugated to a detectable compound 
such as an enzymatic substrate (&g., horseradish peroxidase or alkaline phosphatase) to 

167 



wo 2004/061616 PCT/US2003/041613 

the well and incubating for a period of time, and detecting the presence of the antigen. In 
ELISAs the antibody of interest does not have to be conjugated to a detectable compound; 
instead, a second antibody (which recognizes the . antibody of interest) conjugated to a 
detectable compound may be added to the well. Further, instead of coating the well with 

5 the antigen, the antibody may be coated to the well, hi this case, a second antibody 

conjugated to a detectable compound may be added following the addition of the antigen 
of interest to the coated well. One of skill in the art would be knowledgeable as to the 
parameters that can be modified to increase the signal detected as well as other variations 
of ELISAs known in the art. For further discussion regarding ELISAs see, e^., Ausubel 

10 et al, eds, 1994, Current Protocols in Molecular Biology, Vol. 1, John Wiley & Sons, 
Inc., New York at 11.2.1. 

The binding affinity of an antibody to an antigen and the off-rate of an antibody- 
antigen interaction can be determined by competitive binding assays. One example of a 
competitive binding assay is a radioiminunoassay comprising the incubation of labeled 
1 5 antigen (e.g., or ^^^I) with the antibody of mterest in the presence of increasing 
amounts of unlabeled antigen, and the detection of the antibody bound to the labeled 
antigen. The afiSnity of an antibody of the present invention or a fragment thereof for the 
obesity related gene products of the present invention and the binding off-rates can be 
• deteraiined from the data by scatchard plot analysis. Competition with a second antibody 

* 

20 can also be determined using radioimmunoassays. In this case, osteopontin is incubated 
with an antibody of the present invention or a fragment thereof conjugated to a labeled 
compound (e.g., or in the presence of increasing amounts of an unlabeled second 
antibody. 

BIAcore kinetic analysis may also be used to determine the binding on and off 
25 rates of antibodies or fragments thereof to the obesity related gene products of the present 
invention. BIAcore kinetic analysis comprises analyzing the binding and dissociation of 
osteopontin from chips with immobilized antibodies or fragments ttiereof on their surfece. 

One aspect of the invention provides an antibody that binds to a protein consisting 
of the amino acid sequence of SEQ ID NO: 8. In some embodhnents, the antibody is 
30 monoclonal. . Another aspect of flie invention provides a molecule con:q)rising a fragment 
of an antibody that binds to a protdn consisting of the amino acid sequence of SEQ ID 
NO: 8, which fragment binds a protein consisting of the amino acid sequence of SEQ ID 
NO: 8. 

■ 

• ■ 

168 



wo 2004/061616 



PCTAJS2003/041613 



5.18J. OBESITY RELATED GENE PRODUCT ANTIBODY PRODUCTION 

The antibodies of the invention or fragments thereof can be produced by any 
method known in the art for the synthesis of antibodies, in particidar, by chemical 
S synthesis or preferably, by recombinant expression technifiues. 

Polyclonal antibodies can be produced by various procedures well known in the 
art For example, an obesity related gene product of the present invention, as disclosed in 
Section 6.7,5, or an immunogenic or antigenic fragment thereof can be administered to 
various host animals including, but not limited to, rabbits, mice, rats, etc. to induce the 

10 production of sera containing polyclonal antibodies specific for the obesity related gene 
product. Various adjuvants may be used to increase the immunological response, 
depending on the host species, and include but are not limited to, Freund's (complete and 
incomplete), mineral gels such as aluminum hydroxide, surface active substances such as 
lysolecithin, pluronic polyols, polyanions, peptides, oil emulsions, keyhole limpet 

IS hemocyanins, dinitrophenol, and potentially useful hiunan adjuvants such as BOG (bacille 
Calmette-Guerin) and corynebacterium parvum. Such adjuvants are also well known in 
the art 

Monoclonal antibodies can be prqsared using a wide variety of techniques known 
in the art including the use of hybridoma, recombinant, and phage display technologies, 

20 . or a combination thereof For example, monoclonal antibodies can be produced using 
hybridoma techniques including those known in the art and taught, for example, in 
Harlow et ai. Antibodies: A Laboratory Manual, (Cold Spring Harbor Laboratory Press, 
2°^ ed. 1 988); Hammerling, et al , in: Monoclonal Antibodies and T-Cell Hybridomas 
563-681 (Elsevier, N,Y,, 1981) (said references incorporated by reference in their 

25 entireties). The term 'Monoclonal antibody" as used herein is not limited to antibodies 
produced through hybridoma technology. The term ''monoclonal antibody*' refers to an 
antibody that is derived from a smgle clone, including any eukaryotic, prokaryotic, or 
phage clone, and not the method by which it is produced. 

Methods for producing and screening for specific antibodies using hybridoma 
30 technology are routine and well known in the art. Briefly, mice can be immunized with 
osteopontin or ah inimunogenic or antigenic fragment thereof and once an immune 
response is detected, e.g., antibodies specific for osteopontin are detected in the mouse 
serum, the mouse spleen is harvested and splenocytes isolated. The splenocytes are then 
fiised by well known techniques to any suitable myeloma cells, for example cells fix)m 

169 



wo 2004/061616 . PCT/US2003/041613 

cell line SP20 available from the ATCC. Hybridomas are selected and cloned by limited 
dilution. The hybiidoma clones are then assayed by methods known in the art for cells 
that secrete antibodies capable of binding the obesity related gene products of the present 
invention. Ascites fluid, which generally contains high levels of antibodies, can be 
S generated by immunizing mice with positive hybridoma clones. 

Accordingly, the present invention provides methods of generating monoclonal 
antibodies as well as antibodies produced by the method comprising culturing a 
hybridoma cell secreting an antibody of the invention wherein, preferably, the hybridoma 
is generated by fusing splenocytes isolated from a mouse immunized with an obesity 
1 0 related gene product of the present invention or an immunogenic or antigemc fragment 
thereof with myeloma cells and then screening the hybridomas resulting from the fusion 
for hybridoma clones that secrete an antibody able to bind osteopontin. The hybridomas 
may be further screened for secretion of antibodies that mhibit osteopontin function. 

■ > 

Antibody fragments which recognize specific osteopontin epitopes may be 
1 5 generated by any technique known to those of skill in the art For example, Fab and 
F(ab')2 fragments of the invention may be produced by proteolytic cleavage of 
immunoglobulin molecules, using enzymes such as papain (to produce Fab fragments) or 
pepsin (to produce F(ab')2 fragmrats). F(ab')2 fragments contain the variable region, the 
Ught chain constant region and the CHI domain of the heavy chain. Further, the 

■ * * 

20 antibodies of the present invention can also be generated using various phage display 
methods known in the art. 

In phage display methods, functional antibody domains are displayed on the 
surface of phage particles that carry the polynucleotide sequences encoding them, hi 
particular, DNA sequences encoding VH and VL domains are amplified from animal 

25 cDNA Ubraries (e,g., human or murine cDNA libraries of lymphoid tissues). The DNA 
encoding the VH and VL domains are recombined together with an scFv linka: by PGR 
and cloned into a phagemid vector {e.g. , p CANTAB 6 or pComb 3 HSS). The vector is 
electroporated in E. coli and the E. coli is infected with helper phage. Phage used in these 
methods are typically filamentous phage including fd and Ml 3 and the VH and VL 

30 domaiiis are usuaUy recombinantly fused to either the phage gene in or gene VI^ Phage 
expressing an antigen binding donoain that binds to an antigen of interest can be selected 
or identified with antigen, e.g., using labeled antigm or antigen bound or captured to a 
soUd surface or bead. Examples of phage display methods that can be used to make the 

170 



wo 2004/061616 PCT/US2003/041613 

antibodies of the present invention include those disclosed in Brinkman et al., 1995, J. 
Immunol Me&ods 182:41-50; Ames et al., 1995, J. Lnmunol. Methods 184:177-186; 
Kettleborough et al., 1994, Eur. J. ImmunoL 24:952-958; Persic et al., 1997, Gene 187:9- 
18; Burton et al., 1994, Advances in Lnmunology 57:191-280; PCX application No. 
5 PCT/GB91/01 134; PCT pubUcations WO 90/02809; WO 91/10737; WO 92/01047; WO 
92/18619; WO 93/1 1236; WO 95/15982; WO 95/20401; W097/13844; and U.S. Patent 
Nos. 5,698,426; 5,223,409; 5,403,484; 5,580,717; 5,427,908; 5,750,753; 5,821,047; 
5,571,698; 5,427,908; 5,516,637; 5,780,225; 5,658,727; 5,733,743 and 5,969,108; each of 
which is incorporated herein by reference in its entirety. 

10 As described in the above references, after phage selection, the antibody coding 

regions from the phage can be isolated and used to generate whole antibodies, including 
human antibodies, or any other desired antigen binding fragment, and expressed in any 
desired host, including marmnaUan cells, msect cells, plant cells, yeast, and bacteria, e.g., 
as described below. Techniques to recombinantly produce Fab, Fab' and F(ab')2 

15 fragments can also be employed using methods known in the art such as those disclosed 
in PCT publication WO 92/22324; Mullinax et al., 1992, BioTechniques 12(6):864-869; 
and Sawai et al., 1995, AJRI 34:26-34; md Better, et al., 1988, Science 240: 1041-1043 

« • 

(said references incorporated by reference in their entireties). 

To generate whole antibodies, PCR primers including VH or VL nucleotide 
20 sequences, a restriction.site, and a flanking sequence to protect the restriction site can be 
used to amplify the VH or VL sequences in scFv clones. Utilizing cloning techniques 
known to ttiose of skill in the art, the PCR ampUfied VH domains can be cloned into 
vectors expressing a VH constant region, e.g., the human gamma 4 constant region, and 
the PCR amplified VL domains can be cloned into vectors expressing a VL constant 
25 region, e.g., human kappa or lamba constant regions. * Preferably, the vectors for 

expressing the VH or VL domains comprise an EF- la promoter, a secretion signal, a 
cloning site for the variable domain, constant domains, and a selection marker such as 
neomycin. The VH and VL domains may also cloned into one vector expressing the 
necessary constant regions. The heavy chain conversion vectors and Ug^t chain 
30 conversion vectors are then co-transfected into cell lines to generate stable or transient . 
cell lines that express fiiU-length antibodies, e.g, IgG, using techniques known to those of 
skill in the art 



171 



wo 2004/061616 PCTAJS2003/O41613 

For some uses, includiBg in vivo use of antibodies in humans and in vitro 
detection assays, it may be preferable to use human or chimeric antibodies. Completely 
human antibodies are particularly desirable for flierapeutic treatment of human subjects. 
Human antibodies can be made by a variety of methods known in the art including phage 
5 display methods described above using antibody Ubraries derived firom human 

immunoglobulin sequences. See also U.S. Patent Nos. 4,444,887 and 4,716,1 11; and 
PCT pubHcations WO 98/46645, WO 98/50433, WO 98/24893, W098/16654, WO 
96/34096, WO 96/33735, and WO 91/10741; each of which is mcoiporated herein by 
reference in its entirety. 

10 Human antibodies can also be produced using transgenic mice that are incapable 

of expressing functional endogenous immunoglobulins, but which can express human 
immunoglobulin genes. For exan:q)le, the human heavy and> light chain immunoglobulin 
gene conq)lexes may be introduced randomly or by homologous recombination into 
mouse embryonic stem cells. Alternatively, the human variable region, constant region, 

IS ' • and diversity region may be introduced into mouse embryonic stem cells in addition to 
■ the human heavy and light chain genes. The mouse heavy and Hght chain 
immunoglobulin genes may be rendered non-functional separately or simultaneously with 
the introduction of human immunoglobulin loci by homologous recombination. In 
, particular, homozygous deletion of the JH region prevents endogenous antibody 

20 production. The modified embryonic stem cells are expanded and microinjected into 
blastocysts to produce chimeric mice. The chimeric mice are then be. bred to produce 
„ homozygous ofEspring which express human antibodies. The transgenic mice are 
immunized in the normal fashion with a selected antigen, e.g.y all or a portion of a 
polypeptide of the inventioxL Monoclonal antibodies directed against tbe antigen can be 

25 obtained from the immunized, transgenic nnce using conventional hybridoma technology. 
The human immunoglobulin transgenes harbored by the transgenic mice rearrange during 
B cell differentiation, and subsequentiy und^go class switching and somatic mutation 
Thus, using such a technique, it is possible to produce ther^eutically useful IgG, IgA, 
IgM and IgE antibodies. For an overview of this technology for producing human 

30 antibodies, see Lonberg and Huszar (1995, Int. Rev. Immunol. 13:65-93). For a detailed 
discussion of this technology for producing human antibodies and human monocloiial 
antibodies and protocols for producing such antibodies, see, e.g., PCT publications WO 
98/24893; WO 96/34096; WO 96/33735; U.S. Patent Nos. 5,413,923; 5,625,126; 
5,633,425; 5,569,825; 5,661,016; 5,545,806; 5,814,318; and 5,939,598, which are 



172 



wo 2004/061616 PCT/US2003/041613 

incoiporated by reference herein in their entirety. La addition, companies such as 
Abgenix, Inc. (Freemont, CA) and Geiqjharm (San Jose, CA) can be engaged to provide 
human antibodies directed against a selected antigen using technology similar to that 
described above, 

5 A chimeric antibody is a molecule in which different portions of the antibody are 

derived ftom different immunoglobulin molecules such as antibodies having a variable 
region derived ftom a human antibody and a non-human immunoglobulin constant 
region. Methods for producing chimeric antibodies are Imown in the a^^ Seee.g., 
Morrison, 1985, Science 229:1202; Oi et al., 1986, BioTechniques 4:214; GilUes et al., 

10 1989, J. Immunol. Methods 125:191-202; U.S. Patent Nos. 5,807,715; 4,816,567; and 4,8 
1 6397, which are incorporated herein by reference in flieir entirety. Chimeric antibodies 
comprising one or more CDRs fiom human species and framework regions from a non- 
human immunoglobulin molecule can be produced using a variety of techniques known in 
the art including, for example, CDR-grafting (EP 239,400; PCX pubUcation WO 

15 91/09967; U.S. Patent Nos. 5,225,539; 5,530,101; and 5,585,089), veneering or 

resurfecing (EP 592,106; EP 519,596; Padlan, 1991, Molecular Iiimiunology28(4/5):489- 
498; Studnicka et al., 1994, Protein Eagiiieering 7(6):805-814; Roguska et al., 1994, 
PNAS 9:1 :969-973), and chain shufOing (U.S. Patent No. 5,565,332). 

Further, tiie antibodies of the invention can, in turn, be utilized to generate anti- 
20 idiotype antibodies that "mimic" one or more of the obesity related gene products of the 
■ present invention using techniques weU known to those skilled m the art. (See, e.g,, 
Greenspan & Bona, 1989, FASEB J. 7(5):437-444; andNissinoff, 1991, J. hnmunoL 
147(8):2429-2438). 



25 5.18.8. POLYNUCLEOTIDES ENCODING AN OBESITY RELATED GENE 

PRODUCT ANTIBODY 

The invention provides polynucleotides con^rising a nucleotide sequ«ice 
ausoding an antibody of the invention or a firagment thereof. The invention 
encompasses polynucleotides that hybridize under higji stringmcy, intermediate or lower 
30 stringency hybridization conditions, e.g. , as defined supra, to polynucleotides that encode 
an antibody of the invention. 

The polynucleotides may be obtained, and the nucleotide sequence of the 
polynucleotides determined, by any method known in the art Nucleotide sequences 

173 



wo 2004/061616 PCT/US2003/041613 

encoding these antibodies can be determined using any nucleic acid sequencing method 
known in the art Such a polynucleotide encoding the antibody may be assembled firom 
chemically synthesized oligonucleotides as described in Kutmeier et aL, 1994, 
BioTechniques 17:242), which, briefly, involves the synthesis of overlapping 
5 oligonucleotides containing portions of the sequence encoding the antibody, annealing 
and ligating of those oligonucleotides, and then amplification of the ligated 
oligonucleotides by PGR. 

Alternatively, a polynucleotide encoding an antibody may be gen^ted from 
nucleic acid from a suitable source. If a clone containing a imcleic acid encoding a 

10 particular antibody is not available, but the sequence of the antibody molecule is known, a 
nucleic acid encoding the immunoglobulin may be chemically synthesized or obtained 
from a suitable source (e.g.^ an antibody cDNA library, or a cDNA library generated 
from, or nucleic acid, preferably poly A+ RNA, isolated from, any tissue or cells 
expressing the antibody, such as hybridoma cells selected to express an antibody of the 

1 5 inv^tion) by PGR anoplification using synthetic primers hybridizable to the 3 ' and 5 ' 
ends of the sequence or by cloning using an oligonucleotide probe specific for the 
particulat gene sequence to identify, e.g., a cDNA clone from a cDNA library that 
encodes the antibody. Amplified nucleic acids generated by PGR may then be cloned 
into replicable cloning vectors using any method well known in the art 

20 Once the nucleotide sequence of the antibody is determined, the nucleotide 

sequence of the antibody may be manipulated using methods well known in the art for the 
manipulation of nucleotide sequences, e.g. , recombinant DNA techniques, site directed 
mutagenesis, PCR^ etc. (see, for example, the techniques described in Sambrook et aL, 
1990, Molecular Cloning, A Laboratory Manual^ 2"*^ Ed., Cold Spring Harbor Laboratory, 

25 Cold Spring Harbor, NY and Ausubel et ah , eds., 1998, Current Protocols in Molecular 
Biology^ John Wiley & Sons, NY, which are both incorporated by reference herein in 
their entireties), to generate antibodies having a different amino acid sequence, for 
example to create amino acid substitutions, deletions, and/or insertions. 

30 5J8.9.RECOMBmANT EXPIffiSSION OF AN ANTIBODY TO AN OBESI^ 

RELATED GENE PRODUCT 

Recombinant egression of an antibody of the invention, doivative'or analog 
thereof (e.g.^ a heavy or light chain of an antibody of the invention or a portion thereof or 



174 



wo 2004/061616 PCT/US2003/041613 

a single chain antibody of the invention), requires construction of an expression vector 
containing a polynucleotide that encodes the antibody. Once a polynucleotide encoding 
an antibody molecule or a heavy or hght chain of an antibody, or portion thereof 
(preferably, but not necessarily, containing the heavy or light chain variable domain), of 
S the invention has heea obtained, the vector for the production of the antibody molecule 
may be produced by recombinant DNA technology using techniques well known in the 
art. Thus, methods for preparing a protein by expressing a polynucleotide containing an 
antibody encoding nucleotide sequence are described herein. Methods that are well 
known to those skilled in the art can be used to construct expression vectors containing 

10 antibody coding sequences and appropriate transcriptional and translational control 
signals. These methods include, for example, in vitro recombinant DNA techniques, 
synthetic techniques, and in vivo g^etic recombination. The invention, thus, provides 
replicable vectors comprising a nucleotide sequence encoding an antibody molecule of 
the invention, a heavy or light chain of an antibody, a heavy or light chain variable 

15 domain of an antibody or a portion thereof, or a heavy or light chain C3DR, operably 

m 

linked to a promoter. Such vectors may include the nucleotide sequence encoding the 

. * • • • 

constant region of the antibody molecule (see, e.g., PCT Publicatio^ WO 86/05807; PCT 
PubUcation WO 89/01036; and U.S. Patent No. 5,122,464) and the variable domain of flie 
antibody may be cloned into such a vector for expression ofthe entire heavy, the entire . 
20 light chain, or both the entire heavy and Ught chains. 

The expression vector is transferred to a host cell by conventional techniques and 
the transfected cells are then cultured by conventional techniques to produce an antibody 
ofthe invention. Thus, the invention includes host cells containing a polynucleotide ' 
encoding an antibody of the invention or fiagments thereoi^ or a heavy or light chain 
25 thereof, or portion th^eof, or a single chain antibody ofthe invention, operably linked to 
a heterologous promoter. In preferred ^oibodiments for the expression of double-chained 
antibodies, vectors encoding both the heavy and hght chains may be co-expressed in the 
host cell for expression ofthe entire immunoglobuUn molec\ile, as detailed below. 

A variety of host-expression vector systems may be utilized to express the 
30 antibody.molecules of the inv^tition. Such host-expression systems represent vehicles by 
which the coding sequences of interest may be produced and subsequently purified, but 
also repres^t cells which may, when transformed or transfected with the appropriate 
nucleotide coding sequences, express an antibody molecule ofthe invention in situ. 
These include but are not limited to microorganisms such as bacteria (e.g., E. coli, B. 

175 



wo 2004/061616 PCT/US2003/041613 

subtilis) transfonned with recombinant bacteriophage DNA, plasmid DNA or cosmid 
DNA expression vectors containing antibody coding sequences; yeast 
Sacchaiomyces, Pichia) transformed with recombinant yeast expression vectors 
containing antibody coding sequences; insect cell systems infected with recombinant 
5 virus expression vectors (e,g., baculovirus) containing antibody coding sequences; plant 
cell systems infected with recombinant virus expression vectors {e.g.^ cauliflower mosaic 
virus; CaMV; tobacco mosaic virus, TMV) or transformed with recombinant plasmid 
expression vectors (e.g., Ti plasmid) containing antibody coding sequences; or 
mammalian cell systems (e.g., COS, CHO, BHK, 293, 3T3 cells) harboring recombinant 
10 e3q)ression constructs containing promoters derived from the genome of mammalian cells 
(eg., metallothionein promoter) or from mammalian viruses (e.g, the adenovirus late 
promoter, the vaccinia virus 7. 5K promoter). Preferably, bacterial cells such as 
Escherichia colU and more preferably, eukaryotic cells, especially for the expression of 
whole recombinant antibody molecule, are used for the e:q)ression of a recombinant 
1 5 antibody molecule. For example, mammalian cells such as Chinese hamster ovary cells 
(CHO), in conjunction with a vector such as the major intermediate early gene promoter 
• element from human cytomegalovirus is an ejBfective expression system for antibodies 
(Foecking et al, 1986, Gene 45:101; Cockett et a/., 1990, Bio/Technqlogy 8:2). 

In bacterial systems, a number of expression vectors may be advantageously 
20 selected depending upon the use intended for the antibody molecule being expressed. For 
example, when a large quantity of such a protein is to be produced, for the g^eration of 
phamiaceutical compositions of an antibody molecule, vectors which direct the 
expression of high levels of fusion protein products that are readily purified may be 
desirable. Such vectors include, but are not limited to, the E. coli expression vector 
25 pUR278 ^uther et al, 1983, EMBO 12:1791), in which the antibody coding sequence 
may be ligated individually into the vector in frame with the lac Z coding region so that a 
fusion protein is produced; pIN vectors (Enouye & Inouye, 1985, Nucleic Acids Res. 
13:3101-3109; Van Heeke & Schuster, 1989, J. Biol. Chem. 24:5503-5509); and the like. 
pGEX vectors may also be used to express foreign polypeptides as fiision proteins with 
30 glutathione 5-transferase (GST). M general, such fusion proteins are soluble and can 
easily be purified from lysed cells by adsorption and binding to matrix glutathione 
agarose beads followed by elution in the presence of free glutathione. The pGEX vectors 
are designed to include thrombin or factor Xa protease cleavage sites so that the cloned 
target gene product can be released from the GST moiety. 

176 



wo 2004/061616 PCT/US2003/041613 

In an insect system, Autographa califomica nuclear polyhedrosis virus (AcNPV) 
is iised as a vector to express foreign genes. The virus grows in Spodoptera frugiperda 
cells. The antibody coding sequence may be cloned individuaUy into non-essenti^ 
regions (for exan:^>le the polyhedrin gene) of the virus and placed under control of an 
5 AcNPV promoter (for example the polyhedrin promoter). . 

In mammalian host cells, a number of viral-based expression systems may be 
utilized. In cases where an adenovirus is used as an expression vector, the antibody 
coding sequence of interest may be hgated to an adenovirus transcription/translation 
control complex, e.g., the late promoter and tripartite leader sequence. This chimeric 

10 gene may then be inserted in the adenovirus genome by in vitro or in vivo recombination. 
Insertion in a non- essential region of the viral genome (e.g, region El or E3) will result 
in a recombinant virus that is viable and capable of expressing the antibody molecule in 
infected hosts (e.g., see Logan & Shenk, 1984, Prbc. Natl. Acad. Sci. USA 8 1:355-359). 
Specific initiation signals may also be required for efficient translation of inserted 

15 antibody coding sequences. These signals include the ATG initiation codon and adjacent 
sequences. Furthemiore, the initiation codon must be in phase with the reading firame of 

* I 

the desired coding sequence to ensure translation of the entire insert These exogenous 
translational control signals and initiation codons can be of a variety of origins, both 
: natural and synthetic. The efficiency of expression may be.enhanced by the inclusion of 
20 appropriate transcription enhancer elements, transcription terminators, etc. (see, 
. Bittner eta/., 1987, Methods in Enzymol. 153:51-544). 

In addition, a host cell strain may be chosen which modulates the expression of 
the inserted sequences, or modifies and processes the gene product in the specific fashion 
desked. Suchmodifications(e.g.,glycosylation)andprocessing (e.g., cleavage) of 

25 protein products may be important for the function of the protein. Dififerent host cells 
have charact^stic and specific mechanisms for the post-translational processing and 
modification of proteins and gene products. Appropriate cell lines or host systems can be 
chosen to ensure the correct modification and processing of the foreign protein expressed. 
To this end, eukaryotic host cells which possess the cellular machinery for proper 

30 processing of the primary transcript, glycosylation, and phosphorylation of the gene 
product may be used. Such maxmnaUan host cells include but are not limited to CHO, 
VERY, BHK, Hela, COS, MDCK, 293, 3T3, W138, and in particular, breast cancer cell 
lines such as, for exanq)le, BT483, Hs578T, HTB2, BT20 and T47D, and normal 
mammary gland cell line such as, for example, CRL7030 and HsS78Bst 

177 



wo 2004/061616 PCT/US2003/041613 

For long-tenn, high-yield production of recombinant proteins, stable expression is 
preferred. For example, cell lines that stably express the antibody molecule may be 
engineered. Rather than using expression vectors which contain viral origins of 
replication, host cells can be transformed with DNA controlled by appropriate expression 

5 control elements (e^., promoter, enhancer, sequences, transcription tenninators, 

polyadenylation sites, etc.), and a selectable marker. FoUowmg the introduction of the 
foreign DNA, engineered cells may be allowed to grow for 1-2 days in an enriched 
media, and then are switched to a selective media. The selectable marker in the 
recombioant plasmid confers resistance to the selection and allows cells to stably 

10 integrate the plasmid into their chromosomes and grow to form foci which in turn can be 
cloned and expanded into cell hues. This method may advantageously be used to 
engmeer cell lines which express the antibody molecule. Such engineered cell lines may *. 
be particularly useful in screening and evaluation of compositions that interact directly or 
indirectly with the antibody molecule. 

15 A number of selection systems may be used, including but not limited to^ the 

herpes simplex virus thymidine kinase (Wigler et aL^ 1977, Cell 1 1 :223), 
hypoxanthineguanine phosphoribosyltransferase (Szybalska & Szybalski, 1992, Proc. > 
Natl. Acad. ScL USA 48:202), and adenine phosphoribosyltransferase (Lowy et aL, 1980, 
. Cell 22:8-17) genes can be employed in tk-, hgprt- or aprt- cells, respectively. Also, 

20 antimetabolite resistance can be used ais the basis of selection for flie following genes: 
dhfr, which confers resistance to methotrexate (Wigler et a/., 1980, Natl. Acad. Sci. USA 
77:357; O'Hare et al, 1981, Proc. Natt. Acad. Sci. USA 78:1527); gpt, which confers 
resistance to mycophenolic acid (MulUgan & Berg, 1981, Proc. Natl. Acad. Sci. USA 
78:2072); neo, which confers resistance to the amiaoglycoside G-418 (Wu and Wu, 1991, 

25 Biotherapy 3:87-95; Tolstoshev, 1993, Ann. Rev. Pharmacol. Toxicol. 32:573-596; 
Mulligan, 1993, Science 260:926-932; and Morgan and Anderson, 1993, Ann. Rev. 
Biochem. 62: 191-217; May, 1993, TIB TECH ll(5):155-2 15); and hygro, which confers 
resistance to hygromycin (Santrare et a/., 1984, Gene 30:147). Methods commonly 
known in the art of recombinant DNA technology may be routinely applied to select the 

30 desured recombinant clone, and such methods are described, for example, in Ausubel et 
al (eds.). Current Protocols in Molecular Biology, John Wiley & Sons, NY (1993); 
KrieglCT, Gene Transfer and Expression, A Laboratory Manual, Stockton Press, NY 
(1990); and in Chapters 12 and 13, Dracopoh et al (eds), Current Protocols in Human 

178 



wo 2004/061616 PCTAJS2003/041613 

Genetics, John Wiley & Sons, NY (1994); Colbeire-Gar^in et al., 1981, J. Mol. Biol. 
150:1, which are incoiporated by reference herein in their entireties. 

The expression levels of an antibody molecule can be increased by vector 
amplification (for a review, see Bebbington and Hentschel, The use of vectors based on 
5 gene anq)lification for the expression of cloned genes in mammalian cells in DNA 

cloning, VoL3. (AcadCToic Press, New York, 1987)). When a marker in the vector system 
e}qpressing antibody is amplifiable, increase in the level of inhibitor present in culture of 
host cell will increase the number of copies of the marker gene. Since the amplified 
region is associated with the antibody gene, production of the antibody wiU also increase. 
10 See, for example. Grouse et al, 1983, Mol. Cell. Biol. 3:257. 

The host cell may be co-transfected with two expression vectors of the inv^tion, 
the first vector encoding a heavy chain derived polypeptide and the second vector 
encoding a light chain derived polypeptide. The two vectors may contain identical 
selectable markers that enable equal expression of heavy and light chain polypeptides. 
15 Altematively, a single vector may be used that encodes, and is capable of expressing, 
both heavy and light chain polypeptides. In such situations, the light chain should be 
placed before the heavy chain to avoid an excess of toxic free heavy chain (Proudfoot, 
1986, Nature 322:52; and Kohler, 1980, Proc. Nad, Acad. Sci. USA 77:2 197). The 
coding sequences for the heavy and light chains may comprise cDNA or genomic DNA. 

20 Once an antibody molecule of the invention has been produced by recombinant 

expression, it may be purified by any method known in the art for purification of an 
immunoglobulin molecule, for example, by chromatography (e.g., ion exchange, afiOnity, 
particularly by affinity for the specific antigen after Protein A, and sizing column 
chromatography), centrifiigation, differential solubility, or by any other standard 

25 technique for the purification of proteins. Further, the antibodies of the present invmtion 
or firagments thereof may be fiised to heterologous polypeptide sequences described 
herein or otherwise known in the art to facilitate purification. 

5.18.10. OBESITY RELATED GENE ANTI-SENSE NUCLEIC ACIDS 

• • • • - - ~ * * * * ' 

30 The fimction of the obesity related genes disclosed in Section 6.7.5 may be 

inhibited by use of antisense nucleic acids. The present invention provides the 
ther^eutic or prophylactic use of nucleic acids of at least six nucleotides that are 
antisense to a gene or cDNA encoding an obesity related gene product disclosed in 

179 



Wb 2004/061616 PCT/US2003/041613 

Section 6.7.5, or portions thereof. An "antisense" nucleic acid as used herein refers to a 
nucleic acid capable of hybridizing to a portion of a niicleic acid disclosed in Section 
6.7.5 (preferably mRNA, e.g., the sequence of SEQ ID NO: 2, SEQ ID NO: 3, SEQ ID 
NO: 12, SEQ ID NO: 16 and/or SEQ ID NO: 20) by virtue of some sequence 
5 complementarity. The antisense nucleic acid may be con^lementary to a coding and/or 
. noncoding region of an obesity related mRNA. 

The antisense nucleic acids can be oligonucleotides that are double-stranded or 
single-stranded RNA or DNA or a modification or derivative thereof which can be 
directiy administered to a cell, or which can be produced mtracellularly by transcription 
10 of exogenous, introduced sequences. 

The antisense nucleic acids are of at least six nucleotides and are preferably 
oligonucleotides (ranging &om 6 to about 200 oligonucleotides). In specific aspects, the 
• oligonucleotide is at least 10 nucleotides, at least 1 5 nucleotides, at least 100 nucleotides, 
^ or at least 200 nucleotides. The oligonucleotides pan be DNA or RNA or chimeric 
15 mixtures or derivatives or modified versions thereof, single-stranded or double-stranded. 
. The oligonucleotide can be modified at the base moiety, sugar moiety, or phosphate 
backbone. The oligonucleotide may include other appending groups such as peptides, or 
agents facilitating transport across the cell membrane (see, elg.y Letsinger et al.^ 1989, 
' Proc. Natl. Acad. Sci. U.S.A. 86: 6553-6556; Lemaitre et al, 1987, Proc. Natl. Acad. Sci. 
20 84: 648-652; PCT PubUcation No. WO 88/09810, pubUshed December 15, 1988) or 
blood-brain barrier (see, eg., PCT Publication No. WO 89/10134, published April 25, 
1988), hybridization-triggered cleavage agents (see, Krol et al^ 1988, BioTechniques 
6: 958-976) or intercalating agents (see, e.g., Zon, 1988, Pharm. Res. 5: 539-549). 

In a preferred aspect of the mvention, the antisense oligonucleotide is provided, 
25 preferably as single-stranded DNA. The oligonucleotide may be modified at any position 
on its structure with constituents generally known in the art. 

The antisense oUgonucleotides may comprise at least one modified base moiety 
that is selected fix>m the group including, but not limited, to 5-fluorouracil, 5-bromouracil, 
5-chlorouracil, 5-iodouracil, hypoxanthine, xanthine, 4-acetylcytosine, 
30 5-(caxboxyhydr6xyhnethyl) uracil, 5-caibo>^methylaminomethyl-2-fhibm 

5-caiboxymethylaminomethyluracil, dihydrouracil, beta-D-galactosylqueosine, inosine, 
N6-isopentenyladenine, 1-methylguanine, 1-methylinosine, 2,2-dimethylguanine, 
2-methyladenine, 2-methylguanine, 3-methjdcytosine, 5-methylcytosine, N6-adenine, 

180 



wo 2004/061616 PCT/US2003/041613 

7-methylguamne, S-methylaniiiiomethyluracil, 5-methoxyaiiiinomethyl-2-thioiiracil, 
beta-D-mannosylqueosine, 5'-methoxycaiboxymethyluracil, S-methoxyuracil, 
2-methylthic)-N6-isopentenyladeniBe, uracil-S-oxyacetic acid (v), wybutoxosine, 
pseudouracil, queosine, 2-thiocytosiiie, 5-mefliyl-2-thiouracil, 2-thioiiracil, 4-thiouracil, 
5 5-mefhyiuracil, iiracil-5-oxyacetic acid methylester, uracil-5-oxyacetic acid (v), 

5-methyl-2-thiouracil, 3-(3-amino-3-N-2-carboxypropyl) uracil, and 2,6-diaiiiinopurine. 

In anottier embodiment, the oligonucleotide comprises at least one modified sugar 
moiety selected from Hhe group including, but not limited to, arabinose, 
2-fluoroarabinose, xylulose, and hexose. 

10 In yet another embodiment, the oligonucleotide comprises at least one modified 

phosphate backbone selected from the group consisdng of a phosphorothioate, a 
phosphorodittiioate, a phosphoramidothioate, a phosphoramidate, a phosphordiamidate, a 
methylphosphonate, an alkyl phosphotriester, a formacetal, or analogs thereof. 

Tn yet another ^bodiment, the oligonucleotide is an Of-anomeric oligonucleotide. ' 
15 An o^anomeric oligonucleotide forms specific double-stranded hybrids with 

complementary RNA in which, contrary to the usual j3-units, the strands run parallel to 
each otihier (Gautier et aL, 1987, Nucl. Acids Res. 15: 6625-6641). 

The oligonucleotide may be conjugated to another molecule, e.g., a peptide, 
hybridization triggered cross-hnldng agent, transport agent, hybridization-triggered 
20 cleavage agent, etc. 

Oligonucleotides may be synthesized by standard methods known in the art, e.g. 
by use of an automated DNA synthesizer (such as are commercially available from 
Biosearch, Applied Biosystems, etc.). As examples, phosphorothioate oligonucleotides 
may be synthesized by the method of Stem et al (1988, Nucl. Acids Res. 16: 3209), 
25 methylphosphonate oligonucleotides can be prepared by use of controlled pore glass 
polymer supports (Sarin et al, 1988, Proc. Natl. Acad Sci. U.S.A. 85: 7448-7451), etc. 

M a specific embodiment, the antisense oligonucleotides comprise catalytic 
RNAs, or ribozymes (see, e.g., PCT Ihtematioiial Publication WO 90/11364, published 
. . October 4, 1990; Sarver et aL, 1990, Science 24.7: 1222-1225). In another embodiment,. 
30 the oligonucleotide is a 2'-0-methybibonucleotide (Inoue et al., 1 987, Nucl. Acids Res. 
15: 6131-6148), or a chimeric RNA-DNA analog (Inoue et al., 1987, FEBS Lett 215: 
327-330). 

181 



wo 2004/061616 PCTAJS2003/041613 

In an alternative embodiment, antisense nucleic acids are produced intracellularly 
by transcription firom an exogenous sequence. For example, a vector can be introduced in 
vivo such that it is taken up by a cell, within which cell the vector or a portion thereof is 
transcribed, producing an antisense nucleic acid (RNA) of the invention. Such a vector 
S would contain a sequence encoding an antisense nucleic acid. Such a vector can remain 
episomal or become cbromosomally integrated, as long as it can be transcribed to produce 
the desired antismse RNA, Such vectors can be constructed by recombinant DNA 
technology methods standard in the art. Vectors can be plasmid, viral, or others known in 
the art, used for repUcation and expression in mammalian cells. Expression of the 

10 sequences mcoding the antisense RNAs can be by any promoter known in the art to act in 
mammalian, preferably human, cells. Such promoters can be inducible or constitutive. 
Such promoters include, but are not limited to, the SV40 early promoter region (Bemoist 
and Chambon, 1981, Nature 290: 304-3 1 0), the promoter contained in the 3 ' long 
teraunal repeat of Rous sarcoma virus (Yamamoto et al, 1980, Cell 22: 787-797), the 

15 ' herpes thymidine kinase promoter (Wagner et al, 1981, Proc. Nafl. Acad. Sci. U.S.A. 78: 
1441-1445), the regulatory sequences of the metallothionein gene (Brinster et a/., 1 982, 
Nature 296: 39-42), etc. 

The antisense nucleic acids of the invention comprise a sequence complementary 
to at least a portion of an RNA transait)t of a gene disclosed in Section 6.7.5. However, 
20 absolute coniplementarity, alfliough preferred, is not required. A sequence 

"complementary to at least a poition of an RNA^" as referred to herein, means a sequence 
having sufficient complementarity to be able to hybridize with the RNA, forming a stable 
duplex; in the case of double-stranded antisense nucleic acids, a. single strand of the 
. duplex DNA may thus be tested, or triplex formation may be assayed. The ability to 
25 . hybridize will depend on both the degree of complementarity and the length of the 
antis^e nucleic acid. 

Generally, the longor the hybridizing nucleic acid, the more base mismatches with 
an obesity related RNA (target RNA) it may contain and still form a stable duplex (or 
triplex, as the case may be). One skilled in the art can ascertain a tolerable degree of 
30 mismatch by use of standard procedures to det^mine the melting point of the hybridized 
complex. 



182 



wo 2004/061616 PCTAJS2003/041613 

Phannaceutical compositions of the invention, comprising an effective amount of 
an antisense nucleic acid in a pharmaceutically acceptable cairier can be administered in 
therapeutic methods of the invention. 

Hie amount of antisense nucleic acid that will be effective in the treatment of a 
S particular disorder or condition will depend on the nature of the disorder or condition, and 
can be determined by standard clinical techniques. Where possible, it is desirable to 
detemiine the antisense cytotoxicity in vitro^ and then in useful animal model systems 
prior to testing and use in humans. 

In a specific embodiment, pharmaceutical compositions comprising antisense 
10 nucleic acids are administered via liposomes, microparticles, or microcapsules. In various 
embodiments of the invention, it maybe useful to use suchxompositions to achieve 

sustained release of antisense nucleic acids. In a specific embodiment, it may be 

•I • 

. desirable to utilize liposomes targeted via antibodies to specific identifiable central 
nervous system cell types (Leonetti et aLy 1990, Proc. Natl. Acad. Sci. U.S.A. 87: 

m « 

15 2448-2451; Renneisen et al, 1990, J. Biol. Chem. 265: 16337-16342). 

5.18.11. OBESITY RELATED GENE PRODUCT ANALOGS, DERIVATIVES 

AND FRAGMENTS 

The invention further provides methods of modulating the obesity related genes 

20 disclosed in Section 6.7.5 using agonists and promoters of such genes. Agonists include, 

but are not limited to, active firagments thereof (wherein a fi'agment is at least 10, 15, 20, 

30, 50, 75, 100, or 150 amino acid portion of an obesity related gene product disclosed in 

Section 6.7.5) and analogs and derivatives thereof, and nucleic acids encoding any of the 

« 

foregoing. 

^ * 

25 For recombinant expression of obesity related gene products, and fragments, 

« 

derivatives and analogs thereof the nucleic acid containing all or a portion of the 
nucleotide sequence encoding the protein can be inserted into an appropriate expression, 
vector, e.g.^ a vector that contains the necessary elements for the transcription and 
translation of the inserted protein coding sequence. In a preferred embodiment, the 
30 regulatory elements {e.g. , promoter) are heterologous {L e. , not the native gene promote). 
Promoters which may be used include but are not limited to the S V40 early promoter 
(Bemoist and Chambon, 1981, Nature 290; 304-3 10), the promoter contained in the 3* 
long terminal repeat of Rous sarcoma virus (Yamamoto et al, i980, Cell 22: 787-797), 

183 



wo 2004/061616 PCT/US2003/041613 

the herpes thymidine kinase promoter (Wagner et ai, 1981, Proc. NatL Acad. Sci. USA 
78: 1441-1445), the regulatory sequences of the metallothionein gene (Brinster et aL, 
1982, Nature 296: 39-42); prokaryotic expression vectors such as the jS-lactamase 
promoter (ViUa-Kamaroff ef a/., 1978, Proc. Natl. Acad. Sci. USA 75: 3727-373 1) or the 
5 tac promoter (DeBoer et a/., 1983, Proc. Nail. Acad. Sci. USA 80: 21-25; see also '^Useful 
Proteins from Recombinant Bacteria": in Scientific American 1980, 242:79-94); plant 
expression vectors comprising the nopaline synthetase promoter (Herrar-Estrella et aL, 
1984, Nature 303: 209-213) or the cauliflower mosaic virus 35S RNA promoter (Garder 
et aL^ 1981, Nucleic Acids Res. 9:2871), and the promoter of the photosynthetic enzyme 

10 ribulose bisphosphate carboxylase (Herrera-Estrella et aL, 1984, Nature 310: 1 15-120); 
promoter elements from yeast and other fungi such as the Gal4 promoter, the alcohol 
dehydrogmase promoter, the phosphogjycerol kinase promoter, the alkaline phosphatase 
promoter, and the following animal transcriptionai control regions that exhibit tissue 
specificity and have been utilized in transgenic animals: elastase I gene control region 

15 which is active in pancreatic acinar cells (Swift et aL, 1984, Cell 38: 639-646; Omitz et 
al., 1986, Cold Spring Harbor Symp. Quant Biol. 50: 399-409; MacDonald 1987, 
Hq)atology 7: 425-5 15); insulin gene control region which is active in pancreatic beta 
cells (Hanahan et a/., 1985, Nature 315: 1 15-122), immunoglobulin gene control region 
which is active in lymphoid cells (Grosschedl et aL, 1984, Cell 38: 647-658; Adams et 

20 a/., 1985, Nature 318: 533-538; Alexander et al, 1987, Mol. Cell Biol. 7: 1436-1444), 

mouse mammary tumor virus control region which is active in testicular, breast, lymphoid 
and mast cells (Leder et al, 1986, Cell 45: 485-495), albumin gene control region which 
is active in liver (Pinckert et al, 1987, Genes and Devel. 1: 268-276), alpha-fetoprotein 
gene control region which is active in liver (ECrumlauf et al, 1985, Mol. CelL Biol. 5: 

25 1639-1648; Hammer et al, 1987, Science 235: 53-58), alpha-1 antitrypsin gene control 
region which is active in liver (Kelsey et al, 1987, Genes and Devel 1: 161-171), beta 
globin g;ene control region which is active in myeloid ceUs (Mogram et al, 1985, Nature . 
315: 338-340; KoUias et al, 1986, Cell 46: 89-94), myelin basic protein gene control 
region which is active in oligodendrocyte cells of the brain (Readhead et al, 1987, Cell 

30 48: 703-712), myosin light chain-2 gene control region which is active in skeletal muscle 
(Sani 1985, Nature 314: 283-286)^ and gonadotrophic releasing hormone gene control 
region which is active in gonadotrophs of the hypothalamus (Mason et al, 1986, Science 
234: 1372-1378). 



184 



wo 2004/061616 PCTAJS2003/041613 

r 

A variety of host-vector systems may be utilized to express the protein coding 
sequence. These include, but are not limited to, mammalian cell systems infected with 
virus (eg., vaccinia virus, adenovirus, etc.); insect cell systems infected with virus (e.g. 
baculovirus); microorganisms such as yeast containing yeast vectors; or bacteria 
5 transformed with bacteriophage, DNA, plasmid DNA, or cosmid DNA. The expression 
elements ofvectors vary in their strengths and specificities. Depending on the host-vector 
system utilized, any one of a number of suitable transcription and translation elements 
may be used. 

Once an obesity related gene product disclosed in Section 6.7.5, or fragment, 
0 derivative or analog thereof has been recombinantly expressed, it n:iay be isolated and 
purified by standard methods including chromatography (e.g., ion exchange, affinity, and 
sizing column chromatography), centrifiigation, differential solubility, or by any other 
standard technique for the purification of proteins . An obesity related gene product may 
also be purified by any standard purification method fiom natural sources. 

5 Alternatively, an obesity related gene product, analog or derivative thereof of the 

present invention can be synthesized by standard chemical methods known in the art (e.g., 
see Hvmkapiller et al, 1 984, Nature 310:1 05-111). 

« ■ 

Standard techniques known to those of skill in the art can be used to introduce 
• mutations in the nucleotide sequence encoding a molecule of the invention, including, for 

0 . example, site-directed mutagenesis and PCR-mediated mutagenesis that results ia amino 
acid substitutions. Preferably, the derivatives include less than 25 amino acid 
substitutions, less than 20 amino acid substitutions, less than 15 amino acid substitutions, 
. less than 10 amino acid substitutions, less than 5 amino acid substitutions, less than 4 
amino acid substitutions, less than 3 amino acid substitutions, or less than 2 amino acid 

5 substitutions relative to the original molecule. In a preferred embodiment, the derivatives 
have conservative amino acid substitutions are made at one or more predicted 
non-essential amino acid residues. A "conservative amino acid substitution" is one in 
which the amino acid residue . is replaced with an amino acid residue having a side chain 
with a similar charge. Families of amino acid residues having side chains with similar 

•0 . charges have been.defined in the art. These families include amino acids with basic side 
chains (e.g., lysine, arginine, histidine), acidic side chains (e.g., aspartic acid, glutamic 
acid), uncharged polar side chains (e.g., glycine, asparagine, glutamine, serine, threonine, 
tyrosine, cysteine), nonpolar side chains (e.g., alanine, valine, leucine, isoleucine, proline. 



185 



wo 2004/061616 PCT/US2003/041613 

phenylalanine^ methionine, tryptophan), beta-branched side chains ( e.g,^ threonine, 
valine, isoleucine) and aromatic side chains (e.g., tyrosine, phenylalanine, tryptophan, 
histidine). Alternatively, mutations can be introduced randomly along all or part of the 
coding sequence, such as by saturation mutagenesis, and the resultant mutants can be 
5 screened for biological activity to identify mutants that retain activity. Following 

mutagenesis, the encoded protein can be expressed and the activity of the protein can be 
detemiined. 

In a specific embodiment, the obesity related gene analog, derivative or fragment 
thereof is encoded by a nucleotide sequence that hybridizes to the nucleotide sequrace of 

10 SEQ ID N0:2, SEQ ID NO: 3, SEQ ID NO: 12, SEQ ID NO: 16, or SEQ ID NO: 20 
under stringent conditions, e.g., hybridization to filter-bound DNA in 6x sodium 
chloride/sodiiun citrate (SSC) at about 45 °C followed by one or more washes in 
0.2xSSC/0.1% SDS at about 50-65 °C, under higjily stringent conditions, e.g., 
hybridization to filter-bound nucleic acid in 6xSSC at about 45 °C followed by one or 

15 . more washes in O.lxSSC/0.2% SDS at about 68 °C, or under other stringent hybridization 
conditions that are known to those of skill in the art (see, for example, Ausubel, F.M. et 
al,, eds., 1989, Current Protocols in Molecular Biology, Vol. I, Green Publishing 
Associates, Inc. and John Wiley & Sons, Inc., New York at pages 6.3.1-6.3.6 and 2.10.3). 

• ■ « 

In another embodiment, the analog, derivative or fi^gment comprises an amino 
20 add sequence that is at least 35%, at least 40%, at least 45%, at least 50%, at least 55%, 
at least 60%, . at least 65%, at least 70%, at least 75%, at least 80%, at least 85%, at least 
90%, at least 95%, or at least 99% identical to the amino acid sequence of SEQ ID NO: 8, 
SEQ ID NO: 1 1, SEQ ID NO: 15, SEQ ID NO: 18, SEQ ID.NO: 25, SEQ ID NO: 26, or 
SEQ ID NO: 27. 

25 Additionally, the nucleic acid sequence can be mutated in vitro or in v/vo, to 

create and/or destroy translation, initiation, and/or termination sequences, or to create 
variations in coding regions and/or fohn new restriction endonuclease sites or destroy 
preexisting ones, to facilitate further in vitro modification. Any technique for 
mutag^esis known in the art can be used, including, but not linodted to, chemical 

30 mutagenesis, in vitro site-directed mutagenesis (Hutchinson, C, et al^ 1978, J. Biol. • 
Caiem 253:6551), use of TAB® linkers (Pharmacia), etc. 

Manipulations of the sequence may also be made at the protein level. Included 
within the scope of the invention are protein fragments or othor derivatives or analogs that 

186 



wo 2004/061616 PCT/US2003/041613 

are differentially modified during or after translation, e.g., by glycosylation, acetylation, 
phosphorylation, amidation, derivatization by kno>vn protecting/blocking groups, 
proteolytic cleavage, linkage to an antibody molecule or other cellular ligand, etc. Any of 
numerous chemical modifications may be carried out by known techniques including, but 
5 not limited to, specific chemical cleavage by cyanogen bromide, trypsin, chymotrypsin, 

■ 

papain, V8 protease, NaBEU, acetylation, formylation, oxidation, reduction; metabolic . 
synthesis in the presence of tunicamycin, etc. 

in addition, analogs and derivatives of can be chemically synth^ized. 
Furthemiore, if desired, nonclassical amino acids or chemical amino acid analogs can be 

10 introduced as a substitution or addition into the sequence. Non-classical amino acids 
include but are not limited to the D-isomers of the common amino acids, o-amino 
isobutyric acid, 4-amiQobutyric acid, Abu, 2-amino butyric acid, 7-Abu, €-Ahx, 6-amino 
hexanoic acid, Aib, 2-amino isobutyric acid, 3 -amino propionic acid, ornithine, 
norleucine, norvaline, hydroxyproline, sarcosine, citruUine, cysteic acid, t-butylglycine, t- 

15 butylalanine, phenylglycine, cyclohexylalanine, /3-alanine, fluoro-amino acids, designer 
amino acids such as jS-methyl amino acids, Ca-methyl amino acids, Nos-methyl amino 
acids, and amino acid analogs in general. Furthermore, tiie amino acids used to make the 
analogs and derivatives can be D (dextrorotary), L (leVorotary), or some combination of 
D and L. 

20 ' In a specific embodiment, the derivative is a chimeric, or fusion, protein 

comprising an obesity related gene product disclosed in Section 6.7.S or fragment ther^f 
preferably consisting of at least one protein domain or protein structural motib^ or at least 
IS, preferably 20, amino acids of the obesity related piotein) joined at its andno- or 
caifooxy-temunus via a peptide bond to an amino acid sequence of a different protein. In 

25 one embodiment, such a chimeric protein is produced by recombinant expression of a 
nucleic acid encoding the protein (comprising an obesity related protein-coding sequence 
joined in-fi:ame to a coding sequence for a different protein). Such a chimeric product 
can be made by ligating the appropriate nucleic acid sequences encoding the desired 
amino acid sequences to each other by methods known in the art, in the proper coding 

30 frame, and expressing the chimeric product by methods conunonly known in the art . 

•••• •* • ' 

Alternatively, such a chimeric product may be made by protein synthetic techniques, e.g., 
by use of a peptide synthesizer. Chimeric genes comprising portions of obesity related 
gene product (eg. SEQ ID NO: 8, SEQ ID NO: 11, SEQ ID NO: 15, SEQ ID NO: 18, 

187 



wo 2004/061616 PCT/US2003/041613 

SEQ ID NO: 25, SEQ ID NO: 26, or SEQ ID NO: 27) fused to any heterologous protein- 
encoding sequences may be constructed. 

5.18.12. PHARMACEUTICAL COMPOSITIONS AND METHODS OF 
5 ADMINISTRATION 

The invention provides methods of treatment, prophylaxis, and amelioration of 

one or more symptoms associated with obesity by administrating to a subject of an 

effective amoimt of an obesity related gene {e.g, SBQ ID NO: 2, SEQ ID NO: 3, SEQ ID 

NO: 12, SEQ ID NO. 16, SEQ ID NO: 20) modulator, or pharmaceutical composition 

10 comprising an obesity related gene modulator. In a preferred aspect, the obesity related 

gene modulator is substantially purified (e,g. , substantially free &om substances that limit 

. its effect or produce undesired side-effects). The subject is preferably a manmial such as 

non-piimate (e.g., cows, pigs, horses, cats, dogs, rats etc.) and a primate monkeys or 

humans). In a preferred embodiment, the subject is a human. 

15 ' ' 

5.18.12.1. DELIVERY SYSTEMS 
Various delivery systems are known and can be used to adndnister modulators of * 
the invention or fragment thereof, e.g, , encapsulation in liposomes, microparticles, 
microcapsules, recombinant cells capable of e3q}ressing a protein or antibody modulator, 

20 receptor-mediated endocytosis (see, eg., Wu and Wu, 1987, J. Biol. Chem. 262:4429- 
4432), construction of a nucleic acid as part of a retroviral or other vector, etc. Methods 
of administering a modulator, or pharmaceutical composition include, but are not limited 
to, parenteral administration intradermal, intramuscular, intraperitoneal, intravenous 
and subcutaneous), epidural, and mucosal {e.g., intranasal and oral routes). In a specific 

25 embodiment, modulators of the present invention or fragments thereof or pharmaceutical 
compositions are administered intramuscularly, intravenously, or subcutaneously. The 
compositions may be administered by any convenient route, for example by infusion or 
bolus injection, by absorption through epithelial or mucocutaneous linings (e.g., oral 
mucosa, rectal and intestinal mucosa, etc.) and may be administered togeth^ with other 

30 biologicaUy active agents. Admiiiistrafion can-be systemic or to^ 

puhnonaiy administration can also be employed, e,g., by use of an inhaler or nebulizer, 
and formulation with an aerosolizing agent See, e.g., U.S. Patent Nos. 6,019,968, 
5,985,309, 5,934,272, 5,874,064,. 5,290,540, and 4,880,078, and PCT PubUcation No. 



188 



wo 2004/061616 PCT/US2003/041613 

WO 92/19244. In a preferred embodiment, the pharmaceutical composition is delivered 
locally to the site of neural tissue damage, e.g., using osmotic or other types of pumps. 

5.18.12.2. PHARMACEUTICAL COMPOSITIONS 

S The invention also provides that the pharmaceutical composition is packaged in a 

hemietically sealed container such as an aiiq)ule or sachette indicating the quantity of 
modulator. In one embodiment, the modulator is supplied as a dry sterilized lyophilized 
powder or water JBree concentrate in a hermetically sealed container and can be 
reconstituted, e.g., with water or saline io the appropriate concentration for adnodnistration 

10 to a subject. Preferably, the modulator is supplied as a dry sterile lyophilized powder in a 
hermetically sealed container at a unit dosage of at least S mg, more preferably at least 10 . 
mg, at least 15 mg, at least 25 mg, at least 35 mg, at least 45 mg, at least 50 mg, or at least 
75 mg. Preferably, the Hquid form is supplied in a hermetically sealed container at least 1 
mg/ml, more preferably at least 2.5 mg/ml, at least 5 mg/ml, at least 8 mg/ml, at least 1 0 

15 mg/ml, or at least 25 mg/ml. 

In a specific embodiment, it may be desirable to administer the pharmaceutical 
compositions of the invention locally to the area in need of treatment; this may be 
acMevedby, for example, and not by way of liinitation, local infusion, by injection, 
• means of an implant, said implant being of a porous, non-porous, or gelatinous material, 

20 including membranes, such as sialastic membranes, or fibers.. A particularly usefiil 
appUcation involves coating, imbedding or deriyatizing fibers, such as collagen fibers, 
protein polymers, etc. with a modulator of the invention. Other usefiil approaches are 
described in Otto et at., 1989, J Neuroscience Research 22, 83-91 and Otto and Unsicker, 
1990, J Neuroscience 10, 1912-1921, both of which are incorporated herein in their 

25 entireties. Preferably, when administering the modulator, care must be taken to use 
materials to which the modulator does not absorb. 

In another embodiment, the composition can be delivered in a vesicle, in 
particular a Uposome (see Langer, 1990, Science 249:1527-1533 1990); Treat et al, 1989, 
in Liposomes in the Therq>y of Infectious Disease and Cancer, Lopez-Berestein and 
30 Fidler (eds.), liss. New York, pp. 353- 365; and Lopez-Berestein, ibid., pp. 3 17-327; see 
generally ibid.). 

In yet anotho: embodiment, the composition can be delivered in a controlled 
release system. In one embodiment, a pump may be used (see Langer, si^ra; Sefton, 



189. 



wo 2004/061616 PCT/DS2003/041613 

1987, CRC Crit. Ref. Biomed. Eng. 14:20; Buchwald et al., 1980, Surgery 88:507; 
Saudek et al., 1989, N. Engl. J. Med. 321:574). In another embodiment, polymeric 
materials can be used (see e.g., Medical Applications of Controlled Release, Langer and 
Wise (eds.), CRC Pres., BocaRaton, Florida (1974); Controlled Drug Bioavailability, 
5 Drug Product Design and Performance, Smolen and Ball (eds.), Wiley, New York 
(1984); Ranger and Peppas, 1983, J., Macrpmol. Sci. Rev. MacromoL Chem. 23:61; see 
also Levy et al, 1985, Science 228:190; During et al, 1989, Ann. Neurol. 25:351; 
Howard et al, 1989, LNeurosurg. 7 1:105); U.S. Patent No. 5,679,377; U.S. Patent No. 
5,916,597; U.S. Patent No. 5,912,015; U.S. Patent No. 5,989,463; U.S. Patent No. 
10 5,128,326; PCT PubUcation No. WO 99/15154; andPCT PubUcation No, WO 99/20253. 
In yet another embodiment, a controlled release system can. be placed in proximity of the 
. therapeutic target, t.6., nervous tissue (see, e,g., Goodson, 1984, in Medical Applications 
of Controlled Release, supra, vol. 2, pp. 1 15-138). Other controlled release systems are 
discussed in the review by Langer, 1990, Science 249:1527-1533. 

15 in a specific embodiment, where the composition of the invention is a nucleic acid 

encoding modulator, the nucleic acid can be administered in vivo to promote expression 
of its encoded modulator by constructing it as part of an appropriate nucleic acid 
expression vector and administering it so that it becomes intracellular, e,g., by use of a 
retroviral vector (see U.S. Patent No* 4,980,286), or by direct injection, or by use of 

20 microparticle bombardment (e.g., a gene gun; Biolistic, Dupont), or coating with lipids or 
cell-surface receptors or transfecting agents, or by administering it in linkage to a 
homeobox- like peptide which is known to enter the nucleus (see e.g., Joliot et al, 1991, 
Proc. Natl. Acad. Sci. USA 88:1864-1868), etc. Alternatively, a nucleic acid can be 
introduced intracellularly and incorporated within host cell DNA for expression by 

25 homologous recombination. 

The pharmaceutical compositions of the invention comprise a prophylactically or 
therapeutically effective amount of an obesity related gene modulator, and a 
pharmaceuticaUy acceptable carrier. In a specific embodiment, the term 
^^pharmaceutically acceptable'" means approved by a regulatory agency of the Federal or a 

30 . state govermnent or Usted in the U.S. Pharmacopeia or other generally recognized 

.... - ■ • _ . . ... 

pharmacopeia for use in animals, and more particularly in humans. The term '"carrier" 
refers to a diluent, adjuvant {e.g, Freund's adjuvant (complete and incomplete)), 
excipient, or vehicle with which the fher^eutic is administered. Such pharmaceutical 
carriers can be sterile liquids, such as water and oils, including those of petroleun:i, 

190 



wo 2004/061616 PCT/US2003/041613 

animal, vegetable or synthetic origin^ such as peanut oil, soybean oil, mineral oil, sesame 
oil and the like. Water is a preferred canier when the pharmaceutical composition is 
administered intravenously. Saline solutions and aqueous dextrose and glycerol solutions 

» 

can also be employed as liquid carriers, particularly for injectable solutions. Suitable 
5 pharmaceutical excipients include starch, glucose, lactose, sucrose, gelatin, malt, rice, 
flour, chalk, silica gel, sodium stearate, glycerol monostearate, talc, sodium chloride, 
dried skim milk, glycerol, propylene, glycol, water, ethanol and the like. The 
composition, if desired, can also contain minor amounts of wetting or emulsifying agents, 
or pH buffering agents. These compositions can take the form of solutions, suspensions, 

10 ' emulsion, tablets, pills, capsules, powders, sustained-release formulations and the like. 
Oral formulation can include standard carriers such as pharmaceutical grades of maimitol, 
•lactose, starch, magnesium stearate, sodium saccharine, cellulose, magnesium carbonate, 
etc. Examples of suitable pharmaceutical carriers are described in 'Remington's 
Pharmaceutical Sciences" by E. W. Martm. Such compositions will contain a 

15 prophylactically or therapeutically effective amount of the antibody or fragment thereof, 

preferably in purified form, together with a suitable amount of carrier so as to provide the 

' form for proper administration to the patient The formulation should suit the mode of 

« 

admimstratibn. 

IQ' a prefeired embodiment, the composition is fomiulated in accoidance ydth 

t 

20 routine procedures as a pharmaceutical composition adapted for intravenous 

administration to human beings. Typically, compositions for intravenous administration 
are solutions in sterile isotonic aqueous bufifer. Where necessary, the composition may 
also include a solubilizing agent and a local anesthetic such as lignocamne to ease pain at 
the site of the injection. 

25 G^erally, the ingredients of compositions of the invention are supphed either 

separately or mixed together in unit dosage form, for example, as a dry lyophilized 
powder or water free concentrate in a hermetically sealed container such as an anspoule or 
sachette indicating the quantity of active agent Where the composition is to be 
administered by infusion, it can be dispensed with an infrision bottle containing sterile 

30 pharmaceutical grade water or saline. Where the composition is administered by 

mjection, an ampoule of sterile water for injection or saline can be provided so that the 
ingredients may be mixed prior to administration. 

191 



wo 2004/061616 PCTAJS2003/041613 

The compositions of the invention can be fomulated as neutral or salt forms. 
Phannaceutically acceptable salts include those formed with anions such as those derived 
fix)m hydrochloric, phosphoric, acetic, oxalic, tartaric acids, etc., and fliose formed with 
cations such as those derived from sodium, potassium, ammonium, calcium, ferric 
5 hydroxides, isopropylanodne, triethylamine, 2-ethylamino ethanol, histidine, procaine, etc. 

The amount of the composition delivered is that amount that will be effective in 
the methods of treatment of the invention. 

Some embodiments of the present invention provide a pharmaceutical 
composition comprising a therapeutically effective amount of SEQ ID NO: 8, and a 

1 0 phannaceutically acceptable carrier. Some embodiments of the present invention provide 
a pharmaceutical composition comprising a therapeutically effective amount of an 
antibody that binds to SEQ ID NO: 8; and a phannaceutically acceptable carrier. Some 
embodiments of the present invention provide a pharmaceutical composition comprising a 
therapeutically effective amount of a fragment or derivative of an antibody that binds to 

1 5 SEQ JD NO: 8, wherein the fragment or derivative contains the binding domain of the 

antibody, and a phannaceutically acceptable carrier. 

.• . . • • . 

Some embodiments of the present invention provide a pharmaceutical 

composition comprising a therapeutically effective amount of any one of the following 

proteins, and a phannaceutically acceptable carrier: 

.20 a) a purified protein comprising tiie amino acid sequence of SEQ ID NO: 8; 

b) purified protein encoded by a nucleic acid hybridizable to a DNA having a 
sequence consisting of the coding region of SEQ ID NO: 2; 

c) a purified protein comprising an amino acid sequence that has at least 60% 
identity to the amino acid sequence set forth in SEQ ID NO: 8, in which percentage 

25 identity is determined over an amino acid sequence of identical size as SEQ ID NO: 8; or 

d) a purified protein comprising an amino acid sequence that has at least 90% 
identity to the amino acid sequence set forth in SEQ ID NO: 8, m which percaitage 
identity is determined over an amino acid sequence of identical size as SEQ ID NO: 8. 

Some CTibodiments of the present invention provide a pharmaceutical 
30 composition comprising a therapeutically effective amount of one of the following 
nucleic acids and a pharmaceutically acceptable carrier: 



192 



wo 2004/061616 PCT/US2003/041613 

a) an isolated nucleic acid comprising the nucleotide sequence of SEQ ID NO: 2, 
a coding region of SEQ ID NO: 2, SEQ ID NO: 3, a coding region of SEQ ID NO: 3, or 
the complement of any of the foregoing; 

b) an isolated nucleic add of selected from (a) that is a DNA; or an isolated 

5 nucleic acid comprising a nucleotide sequence encoding any one of the following proteins 
or the complement thereof: 

i) a purified protein comprising the amino acid sequence of SEQ ID NO: 8; 

ii) purified protein encoded by a nucleic acid h3d)ridizable to a DNA having a 
sequence consisting of the coding region of SEQ ID NO: 2; 

■ 

10 iii) a purified protein comprising an amino acid sequence that has at least 60% 

identity to the amino acid sequence set forth in SEQ ID NO: 8, in which percentage 
identity is determined over an amino acid sequence of identical size as SEQ ID NO: 8; or 

iv) a purified protein comprising an amino acid sequence that has at least 90% 
identity to the amino acid sequence set foxth in SEQ ID NO: 8, in which percentage 
IS identity is determined over an amino acid sequence of identical size as SEQ ID NO: 8. 

Some embodiments of the present invention provide a pharmaceutical 
' composition comprising a therapeutically effective amount of any one of the following 
recombinant cells and a pharmaceutically acceptable carrier: 

a) a recombinant cell containing an isolated nucleic acid comprising the 

20 nucleotide sequence of SEQ ID NO: 2, a coding region of SEQ ID NO: 2, SEQ ID NO: 3, 
a coding region of SEQ ID NO: 3, or the conaplement of any of the foregoing, in which 
the nucleotide sequence is imder the control of a promoter heterologous to the nucleotide 
sequence; or 

b) a recombinant cell containing a nucleic add vector that comprises the 

25 nucleotide sequence of SEQ ID NO: 2, a coding region of SEQ ID NO: 2, SEQ ID NO: 3, 

« 

a coding region of SEQ ID NO: 3, or the conq)lement of any of the foregoing. 

Some embodiments of the present invention provide a pharmaceutical 
composition comprising a therapeutically effective amount of an antibody tiiat binds to a 
protein comprising the amino add sequence of any one of the following, and a 
30 pharmaceutically acceptable carrier 

i) a purified protein comprising the amino acid sequence of SEQ ID NO: 8^ 

193 



\ 

■ 

wo 2004/061616 PCTAIS2003/041613 

ii) a purified protein encoded by a nucleic acid hybridizable to a DNA having a 
sequence consisting of the coding region of SEQ ID NO: 2; 

iii) a purified protein comprising an amino acid sequence that has at least 60% 
identity to the amino acid sequence set forth in SEQ ID NO: 8, in which percentage 

5 identity is determined over an amino acid sequence of identical size as SEQ ID NO: 8; 
and 

iv) a purified protein comprising an amino acid sequence that has at least 90% 
identity to the amino acid sequence set forth in SEQ ID NO: 8, in which percentage 
identity is determined over an amino acid sequence of identical size as SEQ ID NO: 8. 

10 

5.18.12.3. GENE THERAPY 
In some embodiments, the compositions are delivered by gene ther^y. Gene 
therapy refers to therapy performed by the administration to a subject of an expressed or 
expressible nucleic acid. In this embodiment of the invention, &e nucleic acids produce 
15 their encoded modulator that mediates a therapeutic effect. Any of the methods for gene 
then^y available in the art can be used according to the present invention. Exemplary 
methods are described below. 

For general reviews of the methods of gene ther^y, see Groldspiel et aL, 1993, 
Clinical Pharmacy 12:488-505; Wu and Wu, 1991, Biotherapy 3:87-95; Tolstoshev, 1993, 

20 Ann. Rev. Pharmacol Toxicol. 32:573-596; Mulligan, 1993, Science 260:926-932; and 
Morgan and Anderson, 1993, Ann. Rev. Biochem. 62:191-217; May, 1993, TIBTECH 
1 1(5):155-215. Methods cominonly known in the art of recombinant DNA technology 
which can be used are described in Ausubel et al. (eds.). Current Protocols in Molecular 
Biology y John Wiley & Sons, NY (1993); and Kriegler, Gene Transfer and Expression, A 

25 Laboratory Manual, Stockton Press, NY (1 990). 

In a preferred aspect, a composition of the invention comprises nucleic acids 
encoding a modulator. These nucleic acids are part of an expression vector that expresses 
the modulator in a suitable host In particular, such nucldc acids have promoters, 
preferably heterologoiis prompters, operably linked to the antibody coc^ 
30 promoter being inducible or constitutive, and, optionally, tissue- specific. In another 
particular embodimCTt, nucleic acid molecules are used in which the modulator coding . 
sequences and any other desired sequences are flanked by regions that promote 



194 



wo 2004/061616 PCT/US2003/041613 

homologous recombination at a desired site in the genome, thus providing for 
intracfaromosomal expression of the modulator encoding nucleic acids (KoUer and 
Smithies, 1989, Proc. Natl. Acad. Sci. USA 86:8932-8935; Zijlstra et al, 1989, Nature 
342:435-438), In speciJBc embodiments, where the modulator is an antibody, the 
5 expressed antibody molecule is a single chain antibody. Alternatively, the nucleic acid 
sequences include sequences encoding both the heavy and Ught chains, or firagment^ 
th^eo^ of the antibody. 

Delivery of the nucleic acids into a subject may be either direct, in which case the 
subject is directiy exposed to the nucleic acid or nucleic acid-carrying vectors, or indirect, 
10 in which case, cells are first transformed with the nucleic acids in vitro, then transplanted 
into the subject. These two q)proaches are known, respectively, as in vivo or ex vivo gene 
therapy. 

In a specific embodiment, the nucleic acid sequences are directiy administered in 
vrvo, where it is e;q)ress6d to produce the encoded product This can be accomplished by 

15 any of numerous methods known in the art, eg., by constmcting them as part of an 
appropriate nucleic acid expression vector and administering it so that they become 
intracellular, eg., by infection using defective or attenuated retrovirals or other viral 
vectors (see U.S. Patent No. 4,980,286), or by direct injection of naked DNA, or by use of - ■ 
microparticle bombardment (eg., a gene gun; BioUstic, Dupont), or coating with lipids or ■ 

20 cell-surface receptors or transfecting agents, encapsulation in hposomes, microparticles, 
or microcapsules, or by administering them in linkage to a pq>tide which is known to 
enter the nucleus, by administering it in linkage to a Ugand subject to receptor-mediated 
endocytosis (see, eg., Wu and Wu, 1987, J. Biol. Chem. 262:4429-4432) (which can be 
used to target cell types specifically expressing the receptors), etc. In another 

25 embodiment, nucleic acid-Ugand complexes can be formed in which the ligand comprises 
a fusogenic viral peptide to disrupt endosomes, allowing the nucleic acid to avoid 
lysosomal degradatioa In yet another embodiment, the nucleic acid can be targeted in 
vivo for cell specific uptake and expression, by targeting a specific receptor (see, eg, 
PCT Pubhcations WO 92/06180; WO 92/22635; W092/203 16; W093/14188, WO 

30 . 93/20221). Alternatively, the nucleic acid can be introduced intracellularly and 

incorporated within host cell DNA for expression, by homologous recombination (Koller 
and Smithies, 1989, Proc. Natl. Acad. Sci. USA 86:8932-8935; and Zijlstra et al., 1989, 
Nature 342:435-438). 

195 



wo 2004/061616 PCT/US2003/041613 

in a specific embodiment, viral vectoTS that contains nucleic acid sequences 
encoding an antibody of the invention or fiagments thereof are used. For example, a 
retroviral vector can be used (see Miller et aL, 1993, Meth. EnzymoL 217:581-599). 
These retroviral vectors contain the components necessary for the correct packaging of 
5 the viral genome and integration into the host cell DNA. The nucleic acid sequences 
encoding the antibody to be used in gene tiierapy are cloned into one or more vectors, 
which &cilitates delivery of the gene into a subject. More detail about retroviral vectors 
can be found in Boesen et aL^ 1994, Biotherapy 6:291-302, which describes the use of a 
retroviral vector to deliver the mdr 1 gene to hematopoietic stem cells in order to make 
10 the st^ cells more resistant to chemotherapy. Other references illustrating the use of 
retroviral vectors in gene therapy are Clowes et ah^ 1994, J, Clin. Invest. 93:644-651; 
Klein et aL, 1994, Blood 83:1467-1473; Salmons and Gunzberg, 1993, Human Gene 
Therapy 4:129-141; and Grossman and Wilson, 1993, Curr. Opin. in Genetics and Devel. 
3:110-114. 

15 Adenoviruses are other viral vectors that can be used in gene therapy and can be 

targeted to the central nervous system. Adenoviruses have the advantage of being 
capable of infecting non-dividing cells. Kozarsky and Wilson, 1993, Current Opinion in 
. GeneticsMnd Development 3:499-503 present a review of adenovrms-based gene therapy. 
Other instances of the use of adenoviruses in gene therapy can be found in Rosenfeld et 

20 al., 1991, Science 252:431-434; Rosenfeld et aL, 1992, Cell 68: 143-155; Mastrangeli et 
al., 1993, J. Clin. Invest 91:225-234; PCT PubUcation W094/12649; and Wang et al., 
1995, Gene Thorapy 2:775-783. Adeno-associated virus (AAV) has also been proposed 

4 

for use m gene therapy (Walsh et al, 1993, Pioc. Soc. Exp. Biol. Med. 204:289-300; and 
U.S. Patent No. 5,436,146). 

25 Another approach to gene thmpy involves transferring a gene to cells in tissue 

culture by such methods as electroporation^ hpofection, calcium phosphate mediated 
transfection, or viral infection. Usually, the method of transfer includes the transfer of a 
selectable marker to the cells. The cells are then placed under selection to isolate those 
cells that have taken up and are expressing the transferred gene. Those cells are then 

30 delivered to a subject 

• • 

In tihis embodiment, the nucleic acid is mtroduced mto a cell prior to 
administration in vivo of the resulting recombinant cell. Such introduction can be carried 
out by any method known in the art, including but not limited to transfection, 

196 



wo 2004/061616 PCT/US2003/041613 

electroporation, microinjection, infection with a viral or bacteriophage vector containing 
the nucleic acid sequences, cell fusion, chromosome-mediated gene transfer, 
microcellmediated gene transfer, spheroplast fusion, etc. Numerous techniques are 
known in the art for the introduction of foreign genes into cells (see, LoefOler and 
5 Behr, 1993, Meth. Enzymol. 217:599-618; and Cohen et al, 1993, MetL BnzymoL 

217:618-644) and maybe used in accordance with the present invention, provided that the 
necessary developm^tal and physiological functions of the recipient cells are not 
disrupted. The technique should provide for the stable transfer of the nucledc acid to the 
cell, so that the nucleic acid is e^qpressible by the cell and preferably heritable and 
1 0 expressible by its cell progeny. 

The resulting recombinant cells can be delivered to a subject by various methods 
known in the art Recombinant blood cells hematopoietic stem or progenitor cells) 
are preferably administered intravenously. The amount of cells envisioned for use 
depends on the desired effect, patient state, etc., and can be determined by one skilled in 
IS the art. 

Cells into which a nucleic acid can be introduced for purposes of gene therapy 

♦ 

encompass any desired, available cell type, and include but are not limited to epithelial 
cells, endotheUal cells, keratinocytes, fibroblasts, muscle cells, hepatocytes; blood cells 
such as T lymphocytes, B lymphocytes, monocytes, macrophajges, neutrophils, 
20 eosinophils, megakaryocytes, granulocytes; various stem or progenitor cells, in particular 
hematopoietic stem or progenitor cells, e.g,^ as obtained from bone marrow, umbilical 
cord blood, peripheral blood, fetal liver, etc. In a preferred embodiment, the cell is a 
neural cell. In a preferred embodiment, the cell used for:gene therapy is autologous to the 
subject 

25 Some embodiments of the present invention provide a recombinant cell containing 

an isolated nucleic acid comprising SEQ ID NO: 2, SEQ ID NO: 3, or the complement 
thereto. These recombinant cells may be used for gene ther^y or other purposes. Some 
embodimients of the present invention provide a recombinant cell that contains SEQ ID 
NO: 2, SEQ ID NO: 3, or the complement thereto, in which the nucleotide sequence 

30 encoding SEQ ID NO: 2, SEQ. ID NO: 3, or flie complement tiiereto, is not native to the 
cell. Some embodiments of the present invention provide a recombinant cell containing 
an isolated nucleic acid comprising SEQ ID NO: 2, a coding region of SEQ ID NO: 2, 
SEQ ID NO: 3, a coding region of SEQ ID NO: 3, or the complement of any of the 

197 



I 



wo 2004/061616 PCT/US2003/041613 

foregoing, in which the nucleotide sequence is under the control of a promoter 
heterologous to the nucleotide sequence. Some embodiments of the present invention 
provide a recombinant cell containing a nucleic acid vector that comprises a nucleic acid 
comprising SEQ ID NO: 2, a coding region of SEQ ID NO: 2, SEQ ID NO: 3, a coding 
5 region of SEQ ID NO: 3, or the complement of any of the foregoing. 

One aspect of the present invention provides a method of producing protein. In 
the method, a recombinant cell containing any one of the following nucleic acids: 

(i) an isolated nucleic acid comprising the nucleotide sequence of SEQ ID NO: 2, 
a coding region of SEQ ID NO: 2, SEQ ID NO: 3, a coding region of SEQ ID NO: 3, or 

10 the complement of any of the foregoing; 

(ii) the isolated nucleic acid of claim (i) that is a DNA; 

(iii) an isolated nucleic acid comprising a nucleotide sequence, or the complement 
thereof, encoding any of the following proteins: 

a) a purified protein comprising the amino acid sequence of SEQ ID NO: 8; 

1 5 * b) a purified protein encoded by a nucleic acid hybridizable to a DNA having a 
sequence consisting of the coding region of SEQ ID NO: 2; 

c) a purified protein comprising an amino acid sequ^ce that has at least 60% 

t • 

identity to the amino acid sequence set forfli in SEQ ID NO: 8, in which percentage 
identity is determined over an amino acid sequence of identical size as SEQ ID NO: 8; 

20 r d) a purified protein comprising an amino acid sequence that has at least 90% 
identity to the amino acid sequence set forth in SEQ ID NO: 8, in which percentage 
identity is determined over an amino acid sequence of identical size a3 SEQ ID NO: 8. 

In this aspect of the present invention, the nucleic acid sequence is under the 
control of a promoter heterologous to the nucleotide sequence. The cell is grown such 
25 that the protein encoded by the nucleic add is expressed by the cell Then the expressed 
protein is recovered. One embodiment of the present invention provides an isolated 
protein that is the product of this process. 

One aspect of the present invention provide a method of producing SEQ ID NO: 
8. The method comprises (i) growing a recombinant cell containing a SEQ ID NO: 2, 
30 SEQ ID NO: 3, or the complement thereto, in which the nucleic acid sequence encoding 
. SEQ ID NO: 2, SEQ ID NO: 3, or the complement thereto is imder a promoter that is not 

198 



wo 2004/061616 PCTAJS2003/041613 

native to SEQ ID NO: 2, SEQ ID NO: 3, such that encoded SEQ ID NO: 8 is expressed in 
the cell; and (ii) recovering the expressed SEQ ID NO: 8. Some embodiments of the 
present invention provide the product of a process in accordance with this aspect of the 
invention. 

5 In an embodiment in which recombinant cells are used in g^e therapy, nucleic 

acid sequ^ces encoding a modulator are introduced into the cells such that they are 
expressible by the cells or their progeny, and the recombinant cells are then administered 
in vivo for therapeutic effect. In a specific embodiment, stem or progenitor cells are used. 
Any stem and/or progenitor cells that can be isolated and maintained in ydtro can 
10 potentially be used in accordance with this embodiment of the present invention (see e,g, , 
PCX PubUcation WO 94/08598; Stemple and Anderson, 1992, Cell 7 1 :973-985; 
Rheinwald, 1980, Meth. Cell Bio. 21A:229; and Pittelkow and Scott, 1986, Mayo Clinic 
Proc. 61:771). 

In a specific embodiment, the nucleic acid to be introduced for purposes of g^e 
IS therapy comprises an inducible promoter operably linked to the coding region, such that 
expression of the nucleic acid is controllable by controlling the presence or absence of the 
appropriate inducer of transcription. 

5.18.13. DEMONSTRATION OF THERAPEUTIC UnLrTY 
20 The modulators of the invention can be assayed by any method well known in the 

art The modulators of the invention or fragments thereof are preferably tested in vitro, 
and then in vivo for the desired ther^eutic or prophylactic activity, prior to use in 
humans. For example, in vitro assays that can be used to determine whether 
administration of a specific composition of the present invention is indicated, include in 
25 vitro cell culture assays in which a subject tissue sample is grown in culture, and e3q)osed 
to or othCTwise administered a composition of the present invention, and the effect of 
such a composition of the present invention upon the tissue sample is observed. The 
. following subsections describe various assays that can be used to determine the efficacy 
of the modulators of the invention. 



199 



wo 2004/061616 PCT/US2003/041613 

5.18.13.1. SINGLE DOSE EFFECTS ON FOOD AND WATER INTAKE AND 

BODY WEIGHT GAIN IN FASTED RATS 

Subjects. Male Sprague-Dawley rats (Sasco, St Louis, Mo.) weighing 210-300 g 
at the begmning of the e3q)eriment are used. Animals are triple-housed in stainless steel 
5 hanging cages in a temperature (22^Q and humidity (40-70% RH) controlled animal 
facility with a 12:12 hour light-dark cycle. Food (Standard Rat Chow, PMI Feeds Inc., 
#5012) and water are available ad libitum. 

Apparatus. Consumption data is collected while the animals are housed in 
Nalgene Metabolic cages (Model #650-0100). Each cage comprises subassemblies made 

10 of clear polymethlypentene (PMP), polycarbonate (PC), or stainless steel (SS). The entire 
cylinder-shaped plastic and SS cage rests on a SS stand and houses one animal. The 
animal is contained in the round Upper Chamber (PC) assembly (12 cm high and 20 cm 
in diameter) and lests on a SS floor. Two subassembUes are attached to the Upper 
Chamber. The first assembly consists of a SS feeding chamber (10 cm long, 5 cm high 

15 and 5 cm wide) with a PC feeding drawer attached to the bottom. The feeding drawer has 
two compartments: a food storage compartment with the capacity for approximately 50 g 
of pulverized rat chow, and a food spillage compartment. The animal is allowed access to 
the pulverized chow by an opening in the SS floor of the feeding chamber. The floor of 
the feeding chamber does not allow access to the food dropped into the spillage 

20 compartment 

The second assembly includes a water bottle support, a PC water bottle (100 ml 
capacity) and a graduated water spillage collection tube. The water bottle support funnels 
any spilled water into the water spillage collection tube. The lower chamber consists of a 
PMP separating cone, PMP collection funnel, PMP fluid (urine) collection tube, and a 
25 PMP solid (feces) collection tube. The separating cone is attached to the top of the 
collection funnel, which in turn is attached to the bottom of the Upper Chamber. The 
urine runs off the separating cone onto the walls of the collection funnel and into the urine 
collection tube. The sq>arating cone also separates the feces and funnels it into the feces 
collection tube. 

30 Food consinnption, water consumption, and body weight are measured with an 

Ohaus Portable Advanced scale (± 0.1 gram accuracy). 

Procedure. Prior to the day of testing, anintials are habituated to the testing 
apparatus by placing each animal in a MetaboKc cage for one hour. On the day of the 



200 



wo 2004/061616 PCT/US2003/041613 

experiment, animals that are food deprived the previous night are weighed and assigned 
to treatment groups. Assignments are made using a quasi-random method utilizing the 
body weights to assure that the treatment groups have similar average body weight. 
Animals are then administered either vehicle (generally 0.5% methyl cellulose, MC) or 
5 test compound. At that time, the feeding drawer is filled with pulverized chow, and the 
filled water bottle, the empty uiine and feces collection tubes are weighed. Two hoiirs 
after test compound treatment, each animal is weighed and placed in a Metabolic Cage. 
Following a one hour test session, animals are removed and body weight obtained. The 
food and water contaiiiers are then weighed and the data recorded. 

10 Test Compound. Test Compound is administered orally (0.1-50 mg/kg for oral 

(PO) dosing) using a gavage tube connected to a 3 or 5 ml syringe at a volume of 10 
ml/kg. In some instances test compound is administered by a systemic route (e,g. by 
intravenous injection 0.1-20 mg/kg for i.v. dosing). Test compound for oral dosing is 
made into a homogenous suspension by stirring and ultrasonicating for at least one hour 

15 prior, to dosing. 

Statistical Analyses. The means and standard errors of the mean (SEM) for food 
consumption, water consumption, and body weight change are calculated. One-way 
analysis of variance using Sytat (5.2.1) is used to test for group differences, A significant 
effect is defined as having a p value of <0.05. 

20 The following parameters are defined: Body weight change is the difference 

4 

between the body weight of the animal immediately prior to placement in the metabolic 
cage and its body weight at the end of the one hour test session. Food consmnption is the 
difference in the weight of the food drawer prior to testing and the weight following the 
one hour test session. Water consumption is the difference in the weight of the water 
25 bottle prior to testing and the weigjit following the one hour test session. 

5.18.13.2. OVERNIGHT FOOD INTAKE 
Subjects. Male Sprague-Dawley rats (Sasco, St. Louis, Mo.) weighing 210-300 g 
at the beginning of the experiment are used. Animals are pair or. triple-hoiised in stainless 
30 steel hanging cages in a temperature (22°C) and humidity (40-70% RH) controlled animal 
facility with a 12: 12 hour light-dark cycle. Food (Standard Rat Chow, PMI Feeds Inc., 
#5012) and water are available ad libitum. 

201 



« 



wo 2004/061616 PCTAJS2003/041613 

Apparatus. Consumption and elimination data are obtained while the animals are 
housed in Nalgene Metabolic cages OModel #650-0100). Each cage is comprised of 
subassemblies made of clear polymethlypentene (PMP), polycarbonate (PC), or stainless 
steel (SS). All parts disassemble for quick and accurate data collection and for cleaning. 
5 The entire cylinder-shaped plastic and SS cage rests on a SS stand and houses one animal. 

The animal is contained in the round Upper Chamb^ (PC) assembly (12 cm high 
and 20 cm in diameter) and rests on a SS floor. Two subassemblies are attached to the 
Upper Chamber. The first assembly consists of a SS feeding chamber (10 cm long, 5 cm 
high and 5 cm wide) with a PC feeding drawer attached to the bottom. The feeding 
10 drawer has two compartments: a food storage compartment with the capacity for 

approximately 50 grams of pulverized rat chow, and a food spillage compartment The 
animal is allowed access to the pulverized chow by an opening in the SS floor of the 
feeding chamber. The floor of the feeding chamber does not allow access to the food 

■ 

dropped into the spillage compartment. The second assembly includes a water bottle 
1 5 support, a PC water bottle (1 00 ml capacity) and a graduated water spillage collection 
tube. The water bottle support funnels any spilled water into the water spillage colllecton 

ft 

tube. 

The lower chamber consists of a PMP separating cone, PMP collection funnel, 
PMP fluid (urine) collection tube, and a PMP solid (feces) collection tube. The 
20 separating cone is attached to the top of the collection funnel, which in turn is attached to 
the bottom of the Upper Chamber. The. urine runs off the separating cone onto the walls 
of the collection funnel and into the uiine coUectipn tube. Hie separating cone also 
separates the feces and funnels it into the feces collection tube. 

Food consxmiption, water consumption, tirine exaretion, feces excretion, and body 
25 weight are measured with an Ohaus Portable Advanced scale (±0. 1 gram accuracy). 

Procedure. On the day of the experiment, animals are weighed and assigned to 
treatment groups. Assignments are made using a quasi-random method utilizing the body 
weights to assure that the treatment groups have similar average body weight Two hours 
prior to Ughts off (1 830 hours), animals are administered either vehicle (0.5% methyl 
.30 cellulose, MC) or test compound. At that time, the feeding drawer filled with pulverized 
chow, the filled water bottle, and the empty urine and feces collection tubes are weighed. 
Following dosing, each animal is weighed and placed in the Metabolic Cage. Animals 
are removed from the Metabolic Chamb^ the following morning (0800 hours) and body 

202 



wo 2004/061616 PCT/US2003/041613 

weight obtained. The food and water containers, and the feces and urine collection tubes, 
are weighed and the data recorded. 

Test Compound, Test compound is administered orally (PO) using a gavage tube 
connected to a 3 or S ml syringe at a volume of 10 mVkg. Test compound is made into a 
5 homogenous suspension by stirring and ultrasonicating for at least one hour prior to 
dosing. In some experiments, animals are tested for more than one night In these 
studies, animals are administered, on subsequent nights, the same treatment (test 

♦ 

compound or 0.5% MC) they had received the first night 

Statistical Analyses. The means and standard errors of the mean (SEM) for food 
10 consumption, water consumption, urine excretion, feces excretion, and body weight 

change are calculated. One-way analysis of variance using Sytat (5.2. 1) is used to test for 
group differences. A significant effect is defined as having a p value of <.05. 

The following parameters are defined: Body weight change is the difference 
. between the body weight of the animal immediately prior to placement in the metabolic 
15 cage (1630 hours) and its body weight the following morning (0800 hours). Food. 

consumption is the difference ia the weight of the food drawer at 1630 and the weight at 
0800. Water consumption is the difference in the weight of the water bottle at 1630 and 
the weight at 0800. Fecal excretion is the difference in the weight of the empty fecal 
collection tube at 1630 and the weight at 0800. Urinary excretion is the difference in the 
20 weight ofthe empty urine collection tube at 1630 and the wdgjit at 0800. 

5.18.14. METHODS FOR DETECTING CHANGES IN GENE EXPRESSION OR 

PROTEIN EXPRESSION 

This invention provides several methods for detecting changes in gene expression 
25 or protein expression, including but not limited to the ejqpression of SEQ JD NO: 1 , SEQ 
ID NO: 2, SEQ ID NO: 3, SEQ ID NO: 9, SEQ ID NO: 12, SEQ ID NO: 13, SEQ ID NO: 
16, SEQ ID NO: 19, SEQ ID NO: 20, homologs of each ofthe foregoing, and marker 
genes operably Hnked to each ofthe forgoing. SEQ ID NO: 1, SEQ ID NO: 2, SEQ ID 
NO: 3, SEQ ID NO: 9, SEQ ID NO: 12, SEQ ID NO: 13, SEQ ID NO: 16, SEQ ID NO: 
30 19, SEQ ID NO: 20 are described ia Section 6.7.5, below; Assays for changes in gene 
e3q)ression are well known in the art (see, e.g., PCT Publication No. WO 96/34099, 
published October 31, 1996, which is incorporated by refermce herein in its entirety). 



203 



wo 2004/061616 PCT/US2003/041613 

Such assays may be perfonned in vitro using transformed cell lines, immortalized cell 
lines, or recombinant cell lines. 

The RNA expression or protein expression of an open reading frame (which may 
be of a marker gene or may be of a gene described in Section 6.7.5), regulated by a 
S promoter native to the gene described in Section 6.7.5 may be measured by measuring the 
amount or abundance of the KNA (as RNA or cDNA) or protein. In particular, the assays 
may detect the presence of increased or decreased expression of a gene described in 
Section 6,7.5 (e.g., SEQ ID NO: 1, SEQ ID NO: 2, SEQ ID NO: 3, SEQ ID NO: 9, SEQ 
ID N0:,12, SEQ ID NO: 13, SEQ ID NO: 16, SEQ ID NO: 19, SEQ ID NO: 20) on the 
10 basis of increased or decreased mKNA expression (using, e.g, , nucleic acid probes), 
increased or decreased levels of protein products (using, e.g., antibodies thereto), or 
increased or decreased levels of expression of a marker gene (ag., green fluorescent 
protein "GFP") operably linked to athe 5' promoter region in a recombinant construct 

■ ■ 

The present invention envisions monitoring changes in gene expression (e.g;, a 
15 . gene disclosed in Section 6.7.5, below) or marker gene expression by any expression 
analysis technique known to one of skill in the art, including but not limited to, 
differential display, serial analysis of gene expression (SAGE), nucleic acid array 
; technology, oligonucleotide array technology, GeneChip expression analysis, dot blot 
hybridization, northern blot hybridization, subtractive hybridization, protein chip arrays, 
20 W^em blot, immunoprecipitation followed by SDS PAGE, immunocytochemistiy, 
proteome analysis and mass-spectrometry of two-dimensional protein gels. 

Methods of gene expression profiling to measure changes in gene expression are 
well-known in the art, as exemplified by the following references describing subtractive 
hybridization (Wang and Brown, 1991, Proc NatL Acad. ScL U.SA. 88:11505-11509), 

25 differential display (Liang and Pardee, 1992, Science 257:967-971), SAGE (Velculescu et 
al., 1995, Science 270:484-487), proteome analysis (Humphery-Smith et al., 1997, 
Electrophoresis 18:1217-1242; Dainese et al., 1991 , Electrophoresis 18:432-442), and 
hybridization-based methods employing nucleic acid arrays (Heller et al., 1 997, Proa 
Natl Acad. ScL U.S.A. 94:2150-2155; Lashkari et aL, 1997, Proc. Natl. Acad Sci. U.S.A. 

30 • 94: 13057-13062;. Wodicka et al., 1997, Nature BiotechnoL 15: 1259-1267). Mcroarray 
technology is described in more detail below. 

In one series of embodiments, various expression analysis techniques may be used 
to identify molecules that affect expression of a gene disclosed in Section 6,7.5 or marker 



4 



204 



wo 2004/061616 PCT/US2003/041613 

gene expression, by comparing a cell line expressing a gene disclosed in Section 6.7.5 
(eg., SEQ ID NO: 1, SEQ ID NO: 2, SEQ ID NO: 3, SEQ ID NO: 9, SEQ ID NO: 12, 
SEQ ID NO: 13, SEQ ID NO: 16, SEQ ID NO: 19, SEQ ID NO: 20) or a marker gene 
under the control of a gene promoter sequence in the absence of a test molecule to a cell 
5 line expressing the same gene or marker gene under the control of the same promoter 
sequence in the presence of the test molecule. In a preferred embodiment, expression 
analysis techniques are used to identify a molecule that upregulates a gene disclosed m 
Section 6.7.5 or upregulates marker gene expression upon treatment of a cell with the 
molecule. 

10 

5J8.15. METHODS FOR MONITORING REPORTER GENE EXPRESSION OF 

A GENE OF THE PRESENT INVENTION 

5J8.15.1. HETEROLOGOUS REPORTER GENE CONSTRUCT 
15 In a preferred embodiment, the cell being assayed fdr reporter gene expression 

contains a fusion construct of at least one transcriptional promoter region for a gene 
' disclosed in Sectioii 6.7.5 {e.g., SEQ ID NO: 1, SEQ ID NO: 2, SEQ ID NO: 3, SEQ ID 
NO: 9, SEQ ID NO: 12, SEQ ID NO: 13, SEQ IDi NO: 16, SEQ ID NO: 19, SEQ ID NO: 
20) (also referred to herein as the test gene), or hbmologs of the foregoing, each operably 
20 linked to a marker gene expressing a detectable and/or selectable product. Increased 
expression of a marker gene operably linked to a gene promoter indicates increased 
expression of the test gene. 

The marker gene is a sequence encoding a detectable or selectable marker, the 
expression of which is regulated by at least one gene promoter region in the heterologous 

25 construct used in the present invention. Preferably, the assay is carried out in the absence 
of background levels of marker gene expression (e.g., in a cell that is mutant or otherwise 
lacking in the marker gene). If not already lacking in endogenous marker gene activity, 
cells mutant in the marker gene may be selected by known methods, or the cells can be 
made mutant in the marker gene by known gene-disruption methods prior to introducing 

30 the maricer gene (Rothstein, 1983, Metk Enzyntol. 101:202-21 1). 

A marker gene of the invention may be any gene that encodes a detectable and/or 
selectable product The detectable marker can be any molecule that can give rise to a 
detectable signal, eg:, a fluorescent protein or a protein that can be readily visualized or 
that is recognizable by a specitic antibody or that gives rise enzymatically to a signal. 

' 205 



wo 2004/061616 PCT/US2003/041613 

The selectable maiker can be any molecule that can be selected for its expression, e.g., 
which gives cells a selective advantage over cells not having the selectable marker imder 
appropriate (selective) conditions. In preferred aspects, the selectable marker is an 
essential nutrient in which the cell in which the interaction assay occurs is mutant or 
5 otherwise lacks or is deficient, and the selection medium lacks such nutrient In one 
embodiment, one type of marker gene is used to detect gene expression. In another 
embodiment, more than one type of marker gene is used to detect gene expression. 

Preferred maiker genes include but are not limited to, green fluorescoit protein 
(GFP) (Cubitt et al, 1995, Trends Biochem. Sci. 20:448-455), red fluorescent protein, 
10 blue fluorescent protein, luciferase, LEU2, LYS2, ADE2, TRPl, CANl, CYH2, GUS, 
CXJPl or chloramphenicol acetyl transferase (CAT). Other maricer genes include, but are 
not limited to, 'URA3,.HIS3 and/or the lacZ genes (see e,g.. Rose and Botstein, 1983, 
Metk Enzymol 101:167-180) operably linked to GAL4 DNA-binding domain 
recognition elements. Alam and Cook disclose non-limiting examples of detectable 

« 

15 marker genes that can be operably linked to a glucan synthase pathway rq)orter gene 
promoter region (Alam and Cook, 1990, Anal Biochem. 188:245-254). 

In a preferred embodiment, more than one different marker gene is used to detect - 
transcriptional activation, eg., one encoding a detectable marker, and one or more 
encoding one or more different selectable marker(s), or e.g;, different detectable markers. • 
20 Expression of the marker genes can be detected and/or selected for by techniques known 
in the art {see e,g, U.S. Patent Nos. 6,057,101 and 6,083,693). 

Methods to construct a suitable reporter construct are disclosed hereiu by way of 
illustration and not limitation and any other methods known in the art may also be used. 
In a preferred embodiment, the reporter gene construct is a chimeric reporter construct 

25 comprising a marker gene that is transcribed under the control of a gene promoter 

sequence comprising all or a portion of a promoter region of SEQ ID NO: 1, S£Q ID NO: 
2, SEQ ID NO; 3, SEQ ID NO: 9, SEQ ID NO: 12, SEQ ID NO: 13, SEQ ID NO: 16, 
SEQ ID NO: 19, SEQ ID NO: 20. If not akeady a part of the DNA sequence, the 
translation initiation codon, ATG, is provided in the correct reading firame upstream of 

30 . the DNA sequence. . ; . ...... 

Vectors comprising aU or portions of the gene sequences of SEQ ID NO: 1, SEQ 
ID NO: 2, SEQ ID NO: 3, SEQ ID NO: 9, SEQ ID NO: 12, SEQ ID NO: 13, SEQ ID NO: 
16, SEQ ID NO: 19, SEQ ID NO: 20 useful in the constraction of recombinant reporter 



206 



wo 2004/061616 PCTAJS2003/041613 

gene constructs and cells are provided The vectors of this invention also include those 
vectors comprising DNA sequences that hybridize under stringent conditions to S£Q ID 
NO: 1, SEQ ID NO: 2, SEQ ID NO: 3, SEQ ID NO: 9, SEQ ID NO: 12, SEQ ID NO: 13, 
SEQ ID NO: 16, SEQ ID NO: 19, SEQ ID NO: 20 gene sequences, and conservatively 
S modified variations thereof. 

The vectors of this invention may be present in transformed or transfected cells, 
cell lysates, or in partially purified or substantially pure forms. DNA vectors may contain 
a means for amplifying the copy number of the gene of interest, stabilizing sequences, or 
alternatively may be designed to favor directed or non-directed integration into the host 
10 cell genome. 

Given the strategies described herein, one of skill in the art can construct a variety 
of vectors and nucleic acid molecules comprising functionally equivalent nucleic acids. 
DNA cloning and sequencing methods are well known to those of skill in the art and are 
described in an assortment of laboratory manuals, including Sambrook et al, 1989, supra\ 
IS and Ausubel et al^ 2002 Supplement. 

Transformation and other methods of introducing nucleic acids into a host cell 
(e.g., transfection, electroporation, hposome delivery, membrane fusion techniques, high 
velocity DNA-coated pellets, viral infection and protoplast fusion) can be accomplished 
by a variety of methods that are well known in the art (see, for instance, Ausubel, supra^ 
20 and Sambrook, supra). S. cerevisiae cells of the inv^tion can be transformed or 

transfected with an expression vector, such as a plasmid, a cosmid, or the like, wherein 
- die expression vector coiiq)rises the DNA of interest Alternatively, the cells may be 
infected by a viral expression vector comprising the DNA or RNA of interest. 

Particular details of the transfection and expression of nucleic acid sequences are 
25 well docummted and are understood by those of skill in the art. Furth^ details on the 
various technical aspects of each of the steps used in recombinant production of foreign 
genes in e:q)ression systems can be found in a number of texts and laboratory manuals in 
the art (see, e.g., Ausubel et al., 2002, herein incorporated by reference). 



30 5.18.15.2. OTHER METHODS FOR MONITORING REPORTER GENE 

EXPRESSION 

In accordance with the present invention, rq)oiter gene expression can be 
inonitored at fhe KNA or the protem level. a specific embodiment, molecules that 

207 



wo 2004/061616 PCT/US2003/041613 

affect reporter gene expression may be identified by detecting differences in the level of 
marker protein expressed by cells contacted with a test molecule versus the level of 
marker protein expressed by cells in the absence of the test molecule. 

Protein expression can be monitored using a variety of methods that are well 
5 known to those of skill in the art. For example, protein chips or protein microarrays (e.g., 
ProteinChip™, Ciphergen Biosystem) and two-dimensional electrophoresis (see e.g., U.S. 
Patent No. 6,064,754 which is incorporated herein by reference in its entirety) can be 
utilized to monitor protein expression levels. As used herein ^^o-dimensional 
electrophoresis") (2D-electrophoresis) means a technique comprising isoelectric focusing, 
10 followed by denaturing electrophoresis, generating a two-dimensional gel (2D-gel) 

containing a plurality of proteins. Any protocol for 2D-electrophoresis known to one of 
ordinary skill in the art can be used to analyze protein expression by the reporter genes of 
, the invention. For example, 2D electrophoresis can be performed according to the 
methods described in O'Farrell, 1975, J. Biol. Chem. 250: 4007-4021. 

« 

1 S Liquid High Throughput-Like Assay. In a preferred embodiment, a Uquid higih 

throughput-tike assay is used to determine the protein expression level of a reporter gene. 

The following exemplary, but not limiting, assay may be used: 

A reporter construct is transformed into a.cell straiiL . Cultures from solid media 
• plates are used to innoculate hquid cultures in Casamino Acids media or an equivalent 
20 media. Tins Uquid culture is grown and then diluted in Casamino Acid^ 
equivalent media. 

A test molecule is selected for the assay, preferably but not necessarily along with 
a negative control molecule. The test molecule and negative control molecule are 
separately added to an assay plate containing multiple wells and serially diluted {e.g.^ 1 to 
25 2) into Casamino Acids media plus DMSO in sequential colimms, so that each plate 

contains a range of concentrations of each drug. If a negative control is being used, one 
colimm of each plate may be used as a "no drug" control, containing only Casamino 
Acids media plus DMSO. The skilled artisan will note that different assay plates may be 
used, such as those with 96, 384 or 1536 well format. 

* 

30 An aliquot of liquid reporter strain is added to each well of the serial dilution 

plates from above and mixed. The assay plates are then incubated. After incubation the 
assay plates are analyzed for detectable marker gene product In a preferred embodiment. 



208 



wo 2004/061616 PCT/US2003/041613 

the assay plates are imaged in a Molecular Dynamics Fluorimager SI to measure the 
fluorescence from the GFP reporters. 

The results are then analyzed, as described above. If the drug is an inhibitor of the 
gene product an inhibitor of SEQ ID NO: 4, SEQ ID NO: 5, SEQ ID NO: 6, SEQ 
5 ID NO: 7, SEQ ID NO: 8, SEQ ID NO: 10, SEQ ID NO: 1 1, SEQ ID NO: 14, SEQ ID 
NO: 15, SEQ ID NO: 17, SEQ ID NO: 18, SEQ ID NO: 21, SEQ ID NO: 22, SEQ ID 
NO: 23, SEQ ID NO: 24, SEQ ID NO: 25, SEQ ID NO: 26, SEQ ID NO: 27, SEQ ID 
NO: 28, and SEQ ID NO: 29, the reporter will show increases in fluorescence for the 
higher drug concentrations versus the lower drug concentrations and/or the no drug 
10 controls. 



5.18.15.3. SPECIFIC EMBODIMENTS 
One embodiment of the present invention provides a method for determining 
whether a candidate molecule affects a body weight disorder associated with an organism. 

15 In step (a) of the method, a cell from the organism is contacted with the candidate 

molecule. Alternatively, the candidate molecule is recombinantly expressed within the 
cell. In step (b) of the method, a determination is made as to whether the KNA 
e3q)ression or protein expression in the cell of at least one open reading frame is changed 
in step (a) relative to the expression of the open reading frame in the absence of the 

20 candidate molecule, where each open reading frame is regulated by a promoter native to a 
nucleic add sequence selected from the group consistmg of SEQ ID NO: 1, SEQ ID NO: 
2, SEQ ID NO: 3, SEQ ID NO: 9, SEQ ID NO: 12, SEQ ID NO: 13, SEQ ID NO: 16, 
SEQ ID NO: 19, SEQ ID NO: 20, and homologs of each of the foregoing. The candidate 
molecule affects a body weigjit disorder.associated with the organism when the RNA 

25 expression or protein expression of the at least one open reading frame is changed. The 
candidate molecule does not affect a body weight disorder associated with the organism 
when the RNA expression or protein expression of the at least one open reading frame is 
unchanged. In some embodiments, the body weight disorder is obesity, anorexia nervosa, 
bulimia nrairosa or cachexia. 

30 La some embodiments, the candidate molecule affects a body weigiht disorder 

associated with the organism when a cell fit>m the organism that is contacted with the 
candidate molecule exhibits a lower expression level of a protein sequence in the group 
consisting of SEQ ID NO: 4, SEQ ID NO: 5, SEQ ID NO: 6, SEQ ID NO: 7, SEQ ID 
NO: 8, SEQ ID NO: 10, SEQ ID NO: 1 1, SEQ ID NO: 14, SEQ ID NO: 15, SEQ ID NO: 

209 



wo 2004/061616 PCT/US2003/041613 

17, SEQ ID NO: 18, SEQ ID NO: 21, SEQ ID NO: 22, SEQ ID NO: 23, SEQ ID NO: 24, 
SEQ ID NO: 25, SEQ ID NO: 26, SEQ ID NO: 27, SEQ ID NO: 28, and SEQ ID NO: 29, 
relative to a cell fit)m the organism that is not contacted with the candidate molecule. 

In some embodiments step (b) comprises detemiining whether RNA expression is 
S changed. In some embodiments, step (b) comprises determining whether protein 
e3q)ression is changed In some embodiments, step (b) comprises determining whether 
RNA or protein expression of at least two of the open reading frames is changed In some 
embodiments, step (a) comprises contacting the cell with the candidate molecule and step 
(a) is carried out in a liquid high throughput-like assay. 

10 In some embodiments, the cell comprises a promoter region of at least one gene 

selected from the group consisting of SEQ ID NO: 1, SEQ ID NO: 2, SEQ ID NO: 3, 
SEQ ID NO: 9, SEQ ID NO: 12, SEQ ID NO: 13, SEQ ID NO: 16, SEQ ID NO: 19, SEQ 
ID NO: 20, and homologs of each of the foregoing, each promoter region being operably 
linked to a marker gene. Further, in such ^bodiments, step (b) comprises determining 

1 S whether the KNA expression or protein e?qpression of the marker gene(s) is changed in 
step (a) relative to the expression of the marker gene in the absence of the candidate 
molecule. In some embodinients, the marker gene is selected from the group consisting 
of green fluorescent protein, red fluoreiscent protein, blue fluorescent protein, luciferase, 
LEU2, LYS2, ADE2, TRPl, CANl, CYH2, GUS, CXJPl and chloramphenicol acetyl 

20 transferase. 

' • - 

Another aspect of the invention provides a method of identifying a molecule that 
specifically binds to a ligand selected Scorn the group consisting of (i) a protein encoded 
by a gene selected fix)m the group consisting of SEQ ID NO:. 1, SEQ ID NO: 2, SEQ ID 
NO: 3, SEQ ID NO: 9, SEQ ID NO: 12, SEQ ID NO: 13, SEQ ID NO; 16, SEQ ID NO: 

25 19, SEQ ID NO: 20, and homologs of each of the foregoing, and (ii) a biologically active 
fragment of SEQ ID NO: 4, SEQ ID NO: 5, SEQ ID NO: 6, SEQ ID NO: 7, SEQ ID NO: 
8, SEQ ID NO: 10, SEQ ID NO: 1 1, SEQ ID NO: 14, SEQ ID NO: 15, SEQ ID NO: 17, 
SEQ ID NO: 18, SEQ ID NO: 21, SEQ ID NO: 22, SEQ ID NO: 23, SEQ ID NO: 24, 
SEQ ID NO: 25, SEQ ID NO: 26, SEQ ID NO: 27, SEQ ID NO: 28, and SEQ ID NO: 29. 

30 . The method comprises (a) contacting the ligand with one or more candidate molecules 
under conditions conducive to binding between the Ugand and the candidate molecules; • 
and (b) identifying a molecule within the one or more candidate molecules that binds to 
the ligand. 



210 



wo 2004/061616 PCT/US2003/041613 

5.18.16. METHOD OF TREAXmC OR PREVENTING BODY WEIGHT 

DISORDERS 

One aspect of the inveation provides a method of treatuig or preventing a body 
S weight disord^. The method comprises administering to a subject in which treatment is 
desired a therapeutically effective amount of a molecule that inhibits a function of one or 
more ofthegroiq) consisting ofSEQ ID NO: 8,SEQIDN0: 11,SEQIDN0: 15, SEQ 
ID NO: 18, SEQ ID NO: 25, SEQ ID NO: 26, and SEQ ID NO: 27. In some 
embodiments, the subject is human. In some embodiments, the molecule that inhibits a 

10 function of one or more of the group consisting of SEQ ID NO: 8, SEQ ID NO: 1 1 , SEQ 
ID NO: 15, SEQ ID NO: 18, SEQ ID NO: 25, SEQ ID NO: 26, and SEQ ID NO: 27 is 
selected from the group consisting of an antibody that binds to one of SEQ ID NO: 8, 
SEQ ID NO; 1 1, SEQ ID NO; 15, SEQ ID NO; 18, SEQ ID NO; 25, SEQ ID NO: 26, and 
SEQ ID NO: 27 or a firagment or derivative therefore containing the binding region 

IS thereof a nucleic acid complementary to the RNA produced by transcription of a gene 
encoding one of SEQ ID NO: 8, SEQ ID NO: 1 1, SEQ ID NO: 15, SEQ ID NO: 18, SEQ 
ID NO: 25, SEQ ID NO: 26, and SEQ ID NO: 27. In some embodiments, the molecule ' 
that inhibits a function of one or more of the group consisting of SEQ ID NO: 8, SEQ ID 
NO: 11, SEQ ID NO: 15, SEQ ID NO: 18, SEQ ID NO: 25, SEQ ID NO: 26, and SEQ ID 

20 NO: 27 is an oligonucleotide that (a) consists of at least six nucleotides; (b) comprises a 
sequence conoplementary to at least a portion of an KNA transcript of a gene encoding 
one of SEQ ID NO: 8, SEQ ID NO: 1 1, SEQ ID NO: 15, SEQ ID NO: 18, SEQ ID NO: 
25, SEQ ID NO: 26 or SEQ ID NO: 27; and (c) is hybridizable to said KNA transcript 
under moderately stringent conditions. 

25 Another aspect of the invention provides a method of treating or preventing a 

body weight disorder. The method comprises administering to a subject in which 
treatment is desired a ther^eutically effective amoimt of a molecule that enhances a 
function of one or more of the group consisting of SEQ ID NO: 8, SEQ ID NO: 1 1, SEQ 
ID NO: 15, SEQ ID NO: 18, SEQ ID NO: 25, SEQ ID NO: 26, and SEQ ID NO: 27. In 

30 some embodiments, the subject is human. 

* ' , - - 

Yet another aspect of the invention provides a method of diagnosing a disease or 
disorder or the predisposition to said disease or disorder, wherein the disease or disorder 
is characterized by an aberrant level of one of SEQ ID NO: 1 through SEQ ID NO: 29 in 
a subject. The method comprises measuring the level of any one of SEQ ID NO: 1 

211 



wo 2004/061616 PCT/US2003/041613 

thiou^ SBQ ID NO: 29 in a sample derived from the subject, in which an increase or 
decrease in the level of one of SEQ JD NO: 1 through SEQ JD NO: 29 in the sample, 
relative to the level of one of said SEQ ID NO: 1 through SEQ ID NO: 29 found in an 
analogous sample not having the disease or disorder, indicates the present of the disease 
5 or disorder in the subject In some embodiments, the disease or disorder is a body weight 
disorder, such as obesity, anorexia nervosa, bulimia nervosa, or cachexia. 

Still another aspect of the invention provides a method of diagnosing or screening 
for the presence of or predisposition for developing a disease or disorder iavolving a body 
weight disorder in a subject comprising detecting one or more mutations in at least one of 
10 SEQ ID NO: 1 through SEQ ID NO: 29 in a sample derived from the subject, in which 
the presence of the one or more mutations indicates the presence of the disease or 
disorder or a predisposition for developing the disease or disorder. 

5.18.17. TRANSGENIC ANIMALS 
1 S The invention also provides animal models. Transgenic animals that have 

* 

. incorporated and express a constitutively-functional obesity related gene have use as 

I- • 

animal models of diseases and disorders involving in T-cell overactivation or 
over-proliferation, or in which cell proliferation is desired Such animals can be used to 
screen for or test molecules for the ability to suppress activation and/or proliferation of 

20 T-cells and thus treat or prevent such diseases and disorders, hi one embodiment, animal 
models for diseases and disorders involving obesity related disord^ are provided. Such 
animals can be initially produced by promoting homologous recombination between an 
obesity related gene ia its chromosome and an exogenous obesity related gene that has 
been rendered biologically inactive. Preferably the sequence inserted is a heterologous 

25 sequ^ce, e.g., an antibiotic resistance gene. In a preferred aspect, this homologous 
recombioation is carried out by transforming embryo-derived stem (ES) cells with a 
vector containing an inserdonally inactivated gene, wherein the active gene encodes a 
particular obesity related gene, such that homologous recombination occurs; the ES cells 
are then injected into a blastocyst, and the blastocyst is implanted into a foster moth^, 

30 followed by the birth of the chimeric animal^ also called a"knockout animal," in which 
. an obesity related g^e has been inactivated (see C^ecchi, 1989, Science 244: 
1288-1292). The chimeric animfll can be bred to produce additional knockout animals. 



212 



wo 2004/061616 PCT/US2003/041613 

Chimeric animals can be and are preferably non-human mammals such as mice, hamsters, 
sheq>, pigs, cattle, etc. In a specific embodiment, a knockout mouse is produced. 

Such knockout animals are expected to develop or be predisposed to developing 
diseases or (iisorders involving obesity and thus can have use as animal models of such 
S diseases and disorders, e.g.^ to screen for or test molecules for the ability to promote 
activation or proliferation and thus treat or prevent such diseases or disorders. 

In a different embodiment of the invention, transgenic animals that Imve 
incorporated and express a constitutively-functional obesity related gene have use as 
animal models of diseases and disorders involving in T-ceU overactivation, or in which T 
10 cell activation is desired. 

In particular, each transgenic line expressing a particular key gene under the 
control of the regulatory sequences of a characterizing gene is created by the introduction, 
for example by pronuclear injection, of a vector containing the transgene into a founder 
animal, such that the transgene is transmitted to oJ^pring in the hne. The transgene 

IS preferably randomly integrates into the genome of the founder but in specific 

embodiments may be introduced by directed homologous recombination. In a preferred * . 
embodiment, the transgene is present at a location on the chromosome other than the site 
of the endogenous characterizing gene. In a preferred embodicnent, homologous 
recombination in bacteria is used for target-directed insertion of the key gene sequence 

20 • into the genomic DNA for all or a portion of the characterizing gene, including sufBcient 
• characterizing gene regulatory sequences to promote expression of the chamcterizing 
gene in its endogenous expression pattern. In a preferred embodiment, the characterizing 
gene sequences are on a bacterial artificial chromosome (BAG). In specific 
embodiments, the key gene coding sequences are inserted as a S' fiision with the . 

25 characterizing gene coding sequence such that the key gene coding sequences are inserted 
in fi:ame and directly 3' from the initiation codon for the characterizing gene coding 
sequences. In another embodiment, the key gene codiag sequences are inserted into the 
3' untranslated region (UTR) of the characterizmg gene and, preferably, have their own 
internal ribosome entry sequence (IRES). 

30 The vector (preferably a BAG) coinprisiiig the key gene coding sequences and 

characterizing g^e sequences is then introduced into the genome of a potential founds 
animal to g^erate a line of transgenic animals. Potential founder animals can be 
screened for the selective expression of the key gene sequence in the population of cells 

213 



f 

wo 2004/061616 PCT/US2003/041613 

characterized by expression of the endogenous characterizing gene. Transgenic animals 
that exhibit appropriate expression (e.g., detectable expression of the key gene product 
having the same expression pattern within the animal as the endogenous characterizing 
gene) are selected as founders for a line of transg^c animals. 

S One aspect of the inveation. provides a recombinaat noa-humau animal that is the 

product of a process comprising introducing a nucleic add encoding at least a domain of 
one of SEQ ID NO: 8, SEQ ID NO: 15, SEQ ID NO: 18, SEQ ID NO: 25, SEQ ID NO: 
26, and SEQ ID NO: 27 into the non-human animal 

10 5.19. USING CROSS SPECIES DATA TO ASSOCIATE GENES WITH TRAITS 

OF INTEREST 

i 

Another aspect of the invention provides processes by which cross-spedes data 
(e.g., mouse and human data) are used to associate genes with a trait of interest (e.g., 
obesity). In this aspect of the invention, QTL and genes of interest are identified in a first 
15 species using techniques such as those desoibed in Sections 5.19.1 and 5.19.3, below. 
Then, the QTL and genes identified in the first species are used to identify regions of the 
genome of a second species, or specific genes in the second species, that contribute, 
caiise, or are otherwise associated with a trait of interest using the techniques described in . 
Section 5.19.2, below. 

20 One aspect of the present invention provides ways to identify genes associated 

with a trait using data fiom multiple species. In this aspect of the invention, causal genes 
are identified in a first species using the techniques disclosed in Section 5.19.1, Section 
5.19.3, United States Patent Application Serial Number 60/492,682, filed August 5, 2003 
entitied "Computer systems and methods for inferring causality firom cellular constituent 

25 abundance data". United States Patent Application Serial Number 60/497,480, filed 
August 21, 2003, entitled "Computer systems and methods for inferring causality from 
cellular constituent abundance data'', United States Patent Application Serial Number 
60/400,522 filed August 2, 2002 entitled "Computer systems and methods that use 
clinical and expression quantitative trait loci to associate genes with traits", and United 

30 States Patent AppUcation Serial Number 60/460,303 filed April 2, 2003, entitied 

"Computer systems and methods that use clinical and expression quantitative trait loci to 
associate genes with traits." Typically, this first species is a segregating popidation such 
as a mouse cross. The second species is typically humans or some other oulbred 
population in which controlled crosses are either impossible to arrange or are 

214 



wo 2004/061616 PCTAJS2003/041613 

prohibitively expensive. The identity of the causal genes, or the key drivers, for traits 
related to a disease under study in the first population can be used to identify candidate 
genes causal for a related disease under study m a second species in any one of at least 
three different approaches. 

5 In a first approach, the causal genes firom the first species are directly mapped to 

the corresponding genes (orthologous genes) in tiie second species using standard 
comparative genomics procedures. Such genes can be validated using association-based 
testing in a case/control population in the second species. For example, consider the case 
in which 30 genes have been identified as causal in a controlled cross of a first species 

10 using the techniques disclosed in Section 5. 19.3. The identity of these 30 genes is used to 
find the orthologous genes in the second species. Alternatively or additionally, the 
identity of the 30 genes is used to find corresponding loci in the second species using a 
syntenic map between the two species. Next, each locus or gene identified in the second 
species is validated using marker-based association studies in appropriately selected 

15 case/control populations. 

For example, consider a candidate gene in the second species orthologous to a 
gene in the first species that has been identified as causal for a trait under study using the 
techniques of Section 5.19.3. To validate the gene in the second species, a cohort of the 
second species is assembled. One portion of the cohort will comprise the cases related to 

20 the trait under study (e.g., those individuals m some population identified as obese) and 
the second portion of the cohort population will comprise the controls (e.g., a random 
sampling of individuals firom the population). Each member of the cohort is genotyped 
with respect to markers in flie gene. Such genotyping can involve ascertainment of the 
allele of known markers in the gene or a separate discovery component for and 

25 ascertainment of new markers in the gene. In one approach, the sequence of the gene 
firom each member of the cohort population is obtained and used to identify single 
nucleotide polymorphisms (SNPs) within the gene. Such SNPs typically have a m^ or 
allele and a minor, allele. Those SNPs with a minor allele frequency present in at least 
two percent (or some other minirmim threshold niunber) of the cohort are used as 

30 markers. Each marker is flien tested to see if the marker associates with the trait under 
study in the cohort population. In addition, haplotypes can be reconstructed from the 
collection of SNP data collected in the cohort under study using standard techniques, and 
association testing can then be carried out directiy using tixe haplotypes. Further, the 
haplotypes can be used to reduce the number of SNPs that need to be graotyped in 



215 



wo 2004/061616 , PCT/US2003/041613 

extmded cohorts or other studies that involve association testing between SNPs in the 
gene region and the disease trait of interest 

For instance, a marker will associate with a trait under study when the frequency 
of one of the alleles for the marker differs significantly between the cases and controls. A 

5 marker will not associate with a trait under study when thore is no frequency difference 
for any given allele for the marker between the cases and controls. In preferred 
embodiments, the genotyped markers are h^lotyped using conventional haplotype 
techniques known by those of skill in the art and each such haplotype or representative 
SNPs comprising each haplotype are used in an association study to see if the haplotype 

0 or representative SNPs associate with the trait und^ study. 

To summarize, genetic techniques that rely on pedigree information and that are 
described in Sections 5.19.1 and/or 5.19.3 are used to identify potentially causal genes in 
a segregating population (first species) such as that of a mouse cross. Genes orthologous 
to these genes are identified in another species (second species) such as humans. 

5 Alternatively, loci that syntenically map to such causal genes are identified. The genes or 
loci in the second species are genotyped, haplotypes are constructed and/or individual 
maikers are used in association studies based upon a cohort assembled with respect to the 
. trait of interest. Those genes or loci that have one or more markers and or haplotypes that 
associate with the disease are considered validated as causal for the trait under study and 

0 are subjected to further analysis. 

In a second approach, quantitative or quaUtative genetic studies are used to 
identify cQTL or loci for traits in the second species. To be effective, the trait analyzed 
in the second species has some nexus with the corresponding trait in the first species. For 
example, the trait in the first species could be omental fat pad mass whereas the trait in 

5 the second species could be obesity or visceral fat mass. Li this second approach, a gene 
in the second species is identified as a causal candidate when both of the following 
conditions hold tme: (i) the gene is within a one lod score drop of a cQTL or other form 
of loci in the second species and (ii) the corresponding gene in the first species is 
identified as causal using tibie techniques disclosed in Section 19.3. In such cases, genes 

0 so identified can be validated using the type of association-based method described in the 
first approach above. 

In a third approach, an expression-based association analysis is performed with 
the second species. Genes that (i) have expression pattCTis that associate with a disease 
or trait under study in the first species and are orthologous to genes in a segregating 



216 



wo 2004/061616 PCT/US2003/O41613 

population of a first species that are causal for a related disease or trait in the first species 
are selected for further analysis. To perfoim the association analysis, a cohort of the 
second species is assembled as described above for the first approach. Once a cohort has 
been assembled, a biopsy is taken from each member of the cohort for a particular tissue 
5 of interest (e.g., adipose tissue). These biopsies are used to obtain cellular constituent 
abundance data for each member of the cohort. For example, in the case of obesity, the 
biopsy taken from each subject could be subcutaneous fat tissue and the cellular 
constituent abundance data could be the expression levels of genes expressed in the 
subcutaneous fat tissue. Next, an association analysis is run on all or a portion of the 
10 cellular constituents for which cellular constituent abundance data is available. The goal 
of such a study is to identify which of the cellular constituents associate with the trait 
under study. 

For example, a cellular constituent will associate with the trait if the abundance 
level of the cellular constituent is consistently high in the portion of the cohort population 
15. that exhibits the disease and is consistently low in the portion of the cohort that is 

randomly selected £rom the population. Reciprocally, a cellular constituent will associate 
with the trait if the abundance level of the cellular constituent is consistentiy low in the 
> portion of the cohort that exhibits the disease and is consistentiy high in the portion of the. 
. cohort randomly selected from the population. A cellular constituent will not associate 
20 with a trait when the abundance level of the cellular constituent has no defined pattern 
with respect to the cohort. In other words, a cellular constituent will not associate with a. 
trait if the cellular constituent does not have a pattem of high (or low) abundance in the 
. portion of the cohort that has the trait or the portion of the cohort representing a random 
sampling from the population. Association analyses are further discussed in Section 
25 5.15.1. 

Thus, in the third approach, genes in the outbred second species (e,g. humans) that 
are associated with a complex trait are identified. Those genes that (i) have expression 
patterns that associate with a disease or trait under study in the second'species and (ii) are 
orthologs of genes that are identified as causal for a related disease or trait in the first 
30 inbred species (e.g., a mouse population) using the techniques discussed, for exanq)le, in 
. Section 5. 19.3 are selected as candidates for further validation. Such yaUdation can 
include association-based testing in a case/control cohort population using the association 
techniques described in the first approach. 

217 



wo 2004/061616 PCT/US2003/041613 

One embodiment in accordance with this aspect of the invaation provides a 
method for confirming the association of a query QTL or a query gene in the genome of a 
second species with a cUnical trait T exhibited by the second species. In the method, a 
first QTL or a first gene in a first species that is hnked to a trait T' is found. Here, trait T ' 
5 (exhibited by the Sist species) is indicative of trait T (exhibited by the second species). 
See Section 5.19.1, below. A region of the genome of the first species that comprises the 
first QTL or the first gene is mapped to a region of the genome of the second species. A 
query QTL or a query gene in the second species that is potentially associated with the 
trait T is found. The potential association of the query QTL or the query gene with the 

* 

10 clinical trait T is confirmed when the query QTL or the query gene is in the region of the 
genome of the second species. See Section 5.19.2, below. 

Sections 5.19.1 and 5.19.2 provide an example in which trait T is obesity and trait 
T' is indicative of obesity (trait T). However, the present invention is not limited to the 
obesity trait. Many other traits T can be studied using the techniques disclosed in this 

15 section. Such traits T include, but are not limited to asthma, ataxia telangiectasia, bipolar 
disorder, cancer, common late-onset Alzheimer's disease, diabetes, heart disease, 
hereditary early-onset Alzheimer's disease, hereditary nonpolyposis colon cancer, 
hypertension, infection, maturity-onset diabetes of the young, mellitus, migraine, f 
nonalcoholic fatty liver, nonalcoholic steatohepatitis, non-insulin-depehdent diabetes 

20 melHtus, obesity, polycystic kidney disease, psofiases, scMzophrmi^ 

pigmentosum. Those of skill in the art will recognize the traits T' that are indicative of 
such traits T. 



5.19.1, IDENTIFYING QTL AND GENES IN A FIRST SPECIES THAT ARE 
25 UNKED TO A TRAIT OF INTEREST 

In one embodiment in accordance with this aspect of the invention, a first QTL or 
a first gene in a first species is found by idoitifying two inbred strains of the first species 
that exhibit polymoiphic behavior with respect to a trait T', In one example, the trait T is 
obesity in the second species (human) and trait T' is percent body fat in inbred strains of 
30 mice (first species). In. the example, inbred strains of mice that vary significantly with 
respect to trait T' are crossed to obtain an F2, back-cross or other such segregating 
populatiotL Other types of inbred populations are possible, including, but not limited to, 
backcrosses, F2 intercrosses. Ft populations (formed by randomly mating Fis for M 
generations), F23 design (F2 individuals are genotyped and then selfed). Design IQ (F2 

218 



wo 2004/061616 PCT/US2003/O41613 

fix>m two inbred lines are backcrossed to both parental lines). The goal at this stage is to 
generate a population where trait T' (percent body fat) is segregating. This process is 
depicted graphically in Fig. 55 for EZE responsiveness. More than one trait T' can be 
studied at the same time. In one example, trait T is obesity in human and the traits are 
5 high density lipoprotein (HDL) level, low dmsity lipoprotein level LDL and very low 
density lipoprotein (VLDL) level, free fatty acid level, fat pad masses at several depots, 
and BMI in mice. BMI is delBned in Section 2.3, above. In such cases, traits in mice 
(HDL level, etc.) are indicative of the trait T in human (obesity). Those of skill in the art 
will appreciate that there are a large number of cases where traits in one species are 
10 indicative of a trait T in another species, and all such cases are within the scope of the 
present invention. 

Once a segregating population of the first species has been obtained, it is 
genotyped based upon, for sample, a marker map 78 (Fig. 1). For details on marker 
maps 78 see Section 5.2, above. In addition, the segregating population may be scored 
1 5 with respect to phenotypes associated with the trait of interest. Further, tissues relevant to . 
• the trait T are isolated for expression profiling. The trait data 70, expression data 44 and . 
genotype data 68 (Fig. 2) are then analyzed using methods siich as those described in 
Section 5 . 1 and depicted in Fig. 2 in order to identify genes and patterns of expression 
associated with the traits of interest This process is depicted in Fig. 56. 

20 One use of expression data 44 is to refine the definition of the trait T' (e.g., to 

refine the definition ofpercent body fat in mice). In essence, this refinement of the trait 
T' equates to identifying those subgroups within the whole population that are 
homogenous with respect to trait T'. Fig. 57 depicts the general case where the 
' population under study is homogenous with respect to T^ Circle 5700 comprises 

25 different shs^es to represent different trait subtypes represents The 
dijQFerent subtypes may be phenotypically similar (e.g., all obese), but they can be 
stratified based on differences in the underlying mechanisms that lead to the phenotype of 
interest (TO. 

There are many different ways to identify subtypes within a population using 
30 expression data 44. In other words,, there are many different wscys to stratify a population 
into different subpopulations in order to refine the definition of trait T^ Each 
subpopulation has a more refined definition of trait T'. To illustrate, consider the case in 
which a population exhibits trait T'. In this illustration, cellular constituent expression 

219 



wo 2004/061616 PCT/US2003/041613 

data for each organism in the population is used as a discriminator to break the population 
down into two or more subpopulations. The first subpopulation exhibits trait T^, the 

♦ 

second subpopulation exhibits trait Ty^ and so forth. All the subpopidations exhibit trait 
but now the trait has been refined into trait Tr, Tr, etc. based on the expression data. 

5 One way to stratify a population in order to better characterize (or 

subcharacterize) trait is depicted in Figs. 57 and 58. First, those individuals in the 
subpopulation that are most extreme with respect to trait (subpopulations 5702 and 
5704) are identified. For example, whm trait T' is fat pad mass, the mice with highest fat 
pad mass and lowest fat pad mass in the population are selected as depicted in Fig. 57. 
10 The idea here is to enrich for patterns associated with fhe different subtypes of trait by 
focusing on the phenotypic extremes of trait T'. In one embodiment of the present 

l-Ti fK 4l> fVi ' 

invention, a phenotypic extreme is defined as the top or lowest 40 , 30 ,20 , or 10 
percentile of the population with respect to a given trait T' exhibited by the population. 
Once the phenotypic extremes have been identified, clustering approaches and pattern 
1 5 recognition techniques described in Section 5. 1 and Fig. 2 can be used to identify patterns 
• • of expression that appear to define different subgroups (Ti^, Tj', etc.) within the extremes 
that, when considered together, appear to discriminate between the extreme phenotypes. 

Li one example, the uppCT and lower 25* percentiles of a segregating F2 
population of mice with respect to fet pad mass were examined. The F2 intercross was 

20 . constructed fi-om C57BL/6J and DBA/2J strains of mice. Mice were on a rodent chow 
diet up to 12 months of age, and then switched to a atherogenic high-fat, high-cholesterol 
diet for another four months. Parental and F2 mice were sacrificed at 16 months of age. 
At death the hyers w^e inomediately removed, flash-fix>zen in liquid nitrogen and stored 
at -80 C. Total cellular KNA was purified fiom 25 mg portions using an Rneasy Mini kit 

25 according to the manufacturer's instructions (Qiagen, Valencia, CA). Further details on 
this cross are found in Section 6.4, below. 

Fig. 59 depicts a two-dimensional cluster of the most difierentially expressed set 
of genes in mice comprising the upper and lower 25 percentiles of the subcutaneous fat 
pad mass (FPM) trait in the segregating F2 population. In Fig. 59, the y-axis represents 
30 the 280 genes in mice that £g:e most differeutially expressed in extreme subpopulations of 
the mouse population and the x-axis rq)resCT^ts the mouse popiilation itself. This set of 
genes (the FPM set) can be considered as the most transcriptionally active set of geaes for 
mice falling the in tails of the FPM trait (T) distribution. The selection of this gene set 



220 



wo 2004/061616 PCTAJS2003/O41613 

for clustering was not biased by selecting genes based on eQTL linkage information, their 
ability to discrimiaate between the FPM trait extremes, nor on their correlation to genes 
identified by eQTL and/or trait-discrimination properties. Despite this, when clustering 
on this set of genes over the F2 population, the mice cluster almost perfectly into high 

5 FPM and low FMP groups as shown in Fig. 59. In addition, there appear to be two 
distinct expression patterns for mice in the high FPM group (high FPM group 1; high 
FPM group 2), indicating some degree of heterogeneity in the high FPM mice. This 
example shows how a gene set can be used to cluster a population exhibiting a trait T' 
into subpopulations, where each subpopulation exhibits a different trait Tv that is defined 

0 by a characteristic ^pression pattern of the gene set 

The patterns realized in Fig. 59 serve to define subcategories of the obesity trait, 
FPM. In fact, these patterns refine the definition of FPM into subcategories (Tr, Tr, etc.) 
for the trait in a way that would not be possible without the e}q)r6ssion data. Th^e are 
clearly two distinct expression patterns associated with hig^ FPM mice dqiicted in Fig. 
5 59 over the gene set. This heterogeneity of expression patterns associated with the 
cUnical trait T' almost certainly points to heterogeneity ia the complex disease itself. 

For the FPM trait, a genome-wide scan revealed 4 cQTL with lod scores greater 
than 2.0. Taken together, these cQTL explained sUghtly less than 50% of the variation in 
the FPM trait values. To fiirther elucidate this clinical trait T', the 1 1 1 F2 animals for 
0 which cUnical and gene expression data existed were classified into one of the three 

groups depicted in Fig. 59 high FPM Group 1, high FPM group 2, low FPM group). ' 

Subsequently, separate genetic analysis were performed. Figs. 60 and 61 
respectively depict these analysis for chromosome 2 and chromosome 19 of the mouse 
genome. Experimental details for this QTL analysis are provided in Section 6.4, below. 

5 For each chromosome (2 and 19) three separate analyses were nut The first analysis 

« 

used the entire set of 1 11 F2 animals (curve 6002). The second set used the set of F2 
animals that comprise the high FPM group 1 and low FPM animals (curve 6006). The 
third analysis used the set of F2 animals that comprise the hig|h FPM group 2 and low 
FPM animals (curve 6004). Fig. 60 shows that higih FPM group 1 is not under the control 
0 of the chromosome 2.QTL, but that high FPM group 2 is under the control of the 

chromosome 2 QTL. Fig. 61 shows that the higjh fat animal group is not under the control 
of the chromosome 2 locus (bi^ FPM groi^ 1), is under the control of a chromosome 19 
locus, while high FPM group 2 is not These results indicate that chromosome 2 and 19 



221 



wo 2004/061616 PCT/US2003/041613 

QTL each significantly affect only a subset of the F2 population, a form of heterogeneity 
that speaks directly to the complexity underlying traits such as obesity. The chromosome 
19 QTL explains 19% ofthe variation m the FPM trait for the high FPMgroiip 1 /low 
EPM subset, but would have been completely missed if the expression data had not been 
5 used to define subphenotypes. 

The significance of the QTL with the highest lod scores depicted in Figs. 60 and 
61 were assessed by repeatedly sampling (10,000 times) fit>m the fidl set of F2 animals so 
that groups equal in size to the high FPM group 1 / low FPM and high FPM group 21 low 
FPM groups were obtained for each iteration. None ofthe 10,000 samplings achieved 
10 QTL matching the significances of those given in Figs. 60 and 61. 

To sununarize, one embodiment in accordance with the preset invention 
identifies a first QTL or a first gene in a first species that is linked to a trait T' by crossing 
a first strain and a second strain of the first species in order to obtain a segregating 
population. Next, the population is stratified into a pluraUty of subpopulations. At least 
IS one subpopulation in the plurality of subpopulations represents a phenotypic extreme of 
trait T'. 

Cellular constituent measurements from organisms in the plurality of 
subpopulations is used to identify a cellular constituent set that exhibits a cellular 

* 

constituent measurement pattern associated with the phenotypic extreme. Then the 
20 segregating population is clustered based on measurements ofthe cellular constituent set 
in organisms in the segregating population to obtain a plurality of population clusters 
{e.g. , the high FPM group 1 , high FPM group 2, and low FPM group of Fig. 59). 
Quantitative genetic ^alysis is performed on these population clusters in order to find a 
first QTL or a first gene that is linked to trait T'. In some embodiments, the quantitative 
25 genetic analysis is perfoimed using a method that uses one or more techniques selected 
fix)m the group consisting of linkage analysis (Section S.IS)^ quantitative genetic analysis 
that uses a plurality of cellular constituent measurements as a phenotypic trait (Section 
5.1), and association analysis (Section S.14). In some embodim^ts, this linkage arises 
when the first QTL has a lod score greater than 2.0, 2.5, 3.0, 3.5, 4.0, 4.5, 5.0, or 8.0. In 
30 some embQdixnisnts,the.ceUular constituent measurements are transcription^ 

measurements or translational state measuremmts. In some embodiments, the cellular 
constitumt measurements are translational state measurements that are perfomied using 
an antibody array or two-dimensional gel electrophoresis. In some embodiments, the 



222 



wo 2004/061616 PCTAJS2003/041613 

cellular constituent set comprises a plurality of metabolites and the plurality of cellular 
constituent measurements are derived by i cellular phenotypic technique {e.g., a 
metabolomic technique in which a plurality of levels of metabolites in one or more 
organisms in the segregating population is measured). In some embodiments, the 
S metabohtes comprise an amino acid, a metal, a soluble sugar/ or a complex carbohydrate. 
In some embodiments, the cellular constituent measurements comprise gene expression 
levels, abundance of mRNA, protein e^qpresaon levels, or metabolite levels. 

w 

In another embodiment in accordance with the present inv^tion a jQrst QTL or a 
first gene in a first species that is linked to a trait T ' is found by crossing a first strain and 

10 a second strain of the first species in ord^ to obtain a segregating population. Next, the 
segregating population is divided into a plurahty of subpopulations using a classification 
scheme that classifies each organism in the segregating population into at least one of the 
subpopulations. The subpopulations may or may not represent phenotypic extremes. The - 
classification scheme uses cellular constituent measurements of a plurahty of cellular 

1 S constituents from each organism in the. segregating populatioiL Various classification 
schemes that may be used to perform this step are found in copending United States 
Patent AppUcation No. 60/382,036 filed May 20, 2002 entitied "Computer systems and 
methods for subdividing a complex disease into component diseases" which is hereby 
mcorporated by reference in its entirety! For at least one subpopulation in the plurahty of 

20 subpopulations, quantitative genetic analysis on the subpopulation is performed in order 
to find the first QTL or the first gene. In some embodiments, the cellular constituent 
measurements are transcriptional state measurements or translational state measurements. 
In some embodiments, the cellular constituent measurements are translational state 
measurements that are performed using an antibody array or two-dimensional gel 

25 electrophoresis. In some embodiments, the plurahty of cellular constituent measurements 
comprise a plurahty of metabohtes and the plurahty of ceUular constituent measurements 
are derived by a cellular phenotypic technique such as a metabolomic technique in which 
a plurahty of levels of metabohtes in each of the organisms is measured. Representative 
metabohtes comprise amino acids, metals, soluble sugars, and complex carbohydrates. In 

30 some embodiments, the cellular constituent comprises gene expression levels, abundance 
- of mKNA, protein expression levels, or metabohte levels. In some embodiments, the 
quantitative genetic analysis is performed using a method that uses one or more 
techniques selected firom the group consisting of linkage analysis (Section 5.13), 

■ 

223 \ 



WO 2004/061616 PCTAJS2003/041613 

quantitative genetic analysis that uses a plurality of cellular constituent measurements as a 
phenotypic trait (Section S.l), and association analysis (Section 5.14). 

The experiments summarized by Figs. 59 through 61 focus on the use of mouse 
crosses to elucidate complex traits. However, the present invention provides alternative 
5 methods for elucidating complex traits. For instance, instead of scoring phenotypes and 
expression profiling F2 animals, a set of congenic mice that span the entire genome could * 
be profiled. Fig. 62 gives an example of a congenic strain. The congenic strain is 
constructed from two inbred strains, B6 and CAST, where B6 serves as the background 
strain and CAST serves as the donor strain. The construction of the congenics results in a 

1 0 segment of one chromosome from the donor strain (CAST in Fig. 62) becoming 

intergrossed onto the genome of the background strain (B6 in Fig. 62). For example, in 
Fig. 62, region 6202 of chromosome 6 from CAST becomes intergrossed with 
chromosome 6 of B6, yielding congenic strain B6.CAST Chr.6. Whole sets of such 
congeuics can be constmcted such that each strain in the set covers a part of some 

15 . clux)mosome, with the set taken as a whole covering the aitire genome of the donor 
strain. * 

The advantage of the congenic sets is that ^ey can be used to screen all mice 
makipg up the set for the trait of interest and to identify those strains that exhibit the trait • - 
of interest, compared to the background strain. For instance, iti studying obesity, the 
20 congenic sets can be used to identify ttiose congenic strains that are significantly heavier 
than the background strain. Once such strains are identified, a large amount of work 

■ 

identifying the genes underlying the trait of interest has already been accomplished 
because the causal gene will reside in the congenic region. 

^ 

Congenics are also usefiil once QTL have been identified in an F2 population 
25 constmcted from the same inbred strains making up the congenic set. For instance, once 
a QTL is identified for the trait of interest, the strain whose congenic region covers the 
QTL region can be identified and studied with respect to the same phenotype. Further, 
more complicated genetic models can be constructed using the congenics, based on QTL 
results from, for example, an F2 cross. For example, suppose from the F2 cross, two 
30 QTL that were strongly interacting are identified. The congenic strains covering the two 

QTL regions could be identified and bred to construct a new congenic strain that had two 

/ 

congenic regions, each covering one of the QTL of interest. These mice could then be 
studied with respect to the phenotype of interest Hie advantage to this sort of 



224 



wo 2004/061616 PCT/US2003/041613 

construction is that the congmic strains are stable and can be constantly bred to generate 
progeny that are genetically identical (unlike the F2 populations, where there is no hope 
of recovering the same genetic background). 

Congenics are use&l in studying traits first studied in an F2 populatioiL The trait 
5 itself may vary considerably over the F2 population, but once QTL are identified in the 
F2 that lead to a particular trait value {e.g., low fat), the congenic corresponding to the 
QTL can be identified and scored for the same trait. In many cases, the congenic will 
exhibit the same trait values as those F2 mice under the control of the associated QTL. 

Various techniques have been disclosed for identifying QTL and/or genes in a first 

■ 

10 spedes that are Unked to one or more traits T\ Additional techniques for identifying such 
QTL and/or genes are found in copending United States Patent Applications 60/382,036, 
filed May 20, 2002, entitled "Computer systems and methods for subdividing a complex 
disease into component diseases", 60/381,437, filed May 16, 2002, entitled "Computer " 
system and method for identifying genes and determining pathways associated with 

15 traits"; and 60/400,522, filed August 2, 2002, entitled "Computer systems and methods 
that use clinical and expression quantitative trait loci to associate genes with traits" which 

■ 

aire hereby incorporated by reference in fheir entireties. 

, 5.19.2. roEPmFYING REGIONS OF THE GENOME OF A SECOND SPECIES 
20 THAT ARE LINKED TO A TRAIT OF INTEREST 

■ * 

In the present invention, once the genes and/or QTL in a first species that are 
linked to trait T' (e.g, , the mouse models described in Section 5.19.1) have been 
identified, those QTL and genes of interest are mapped to the genome of a second species 
25 (e.g., himians). In one embodiment, this moping happens through the construction of 
syntenic maps between the first species and the second species (e.g,, between mouse and 
human). The syntenic map is constructed by mapping conserved regions between the two 
species such as EST, mRNA, conserved STS markers, conserved regulatory regions, etc. 

Traits T in populations of the second species that are similar to traits T' studied in 
30 the first species are identified. Furth^, regions in the genome of the second species that 
are associated with the identified traits T are determined. Fig. 63 shows an example in 
which the first species is human and the second species is mouse. In Fig. 63, two 
hypothetical QTL (6302 and 6304) that are linked to a human obesity-related risk trait are 

225 



wo 2004/061616 PCT/US2003/041613 

on a portion of human chromosome 8. The human genome region that includes 
hypothetical QTL 6302 and 6304 is mapped to the mouse genome using a syntenic map 
between mouse and human. It will be appreciated that genes are not mapped at this point 
Ratho:, whole regions are mapped. In some embodiments, the region that is mapped is a 
5 portion of a chromosome, a region that is less than 100 centiMorgans, less than 10 

centiMorgans, or less than 5 centiMorgans. Using syntenic mapping, it is determined that 
hypothetical QTL 6302 and 6304 map to a portion of mouse chromosome 13. 

Next, the QTL data from the first species is combined with that from the second 
species. .For example. Fig. 64 lists fovu: lod score curves for obesity-related traits in 

10 mouse (HDL levels, leptin levels, insuUn levels and fat pad masses). The QTL for all of 
these mouse traits are overlapping. However, in Fig. 65 in which hypothetical human 
QTL 6302 and 6304 (Fig. 63) and the niouse QTL given in Fig. 64. are overlayed, the 
peaks of human hypothetical QTL (6302 and 6304) are aligned with the peaks of two of 
the mouse QTL depicted in Fig. 63. Thus Fig. 65 describes a relationship in which two 

1 5 . feirly closely linked QTL in humans for an obesity related trait overlap with two closely 
linked QTL in mice for obesity-related phenotypes. Further, as depicted in Fig. 66, this 
chromosome 13 region in mouse is a hotspot actiyity for eQTL linkage as well, with 
hundreds of eQTL linking to one of the.two peaks, but ixsually not both. This data 
supports the notion that there are actually two genes here underlying the mouse and 

* 

20 human QTL. 

In some embodiments, traits T are identified in the second species using any of 
• the techniques described in Section 5.19.1, above or in copending United States Patent 
AppUcations 60/382,036, filed May 20, 2002, entitled "Computer systems and methods 
for subdividing a complex disease into component diseases", 60/381,437, filed May 16, 

25 2002, entitled "Computer system and method for identifying genes and determining 
pathways associated with traits"; and 60/400,522, filed August 2, 2002, entitled 
"Computer systems and methods that use clinical and expression quantitative trait lod to 
associate genes with traits" which are hereby incorpomted by reference in their entireties. 
However, in a typical CTibodiment, the second species is human. As such, QTL or gene 

30 identification techniques that make use of congenic strains or plaimed crosses cannot be 
used. Therefore, in some embodiments, association analysis techniques such as those 
described in Section 5.14, above, are used. 



226 



wo 2004/061616 PCTAJS2003/O41613 

The information revealed using the techniques described in this section is 
extremely valuable in deciding what regions to pursue in the second species (e.g., 
humans). The overls^s can be used to refine the interval that will be fine-mapped in 
humans to identify the underlying gene. This results in narrowing the region that must be 
5 pursued, th^eby decreasing the amount of time needed to map an interval. Further, 
methods described in this application can be used .to directly identify genes in mice in 
QTL intervals of interest, and directly validate those using association methods in human 
populations (Fig. 67). This is a process that can short-circuit the fine mapping approach 
and accelerate die process of identifying genes for complex human diseases. 

10 

5.19.3. IDENTIFYING QTL AND GENES IN A FIRST SPECIES USING 

CAUSALITY 

The starting point for the traditional forward genetics approach to dissecting 
complex traits, including common human diseases, is identification of QTL controlling 

IS . for a disease trait of interest. Formore information on complex traits, see Section 5.15. . 
L Grenome-wide scans are performed to identify markers spaced along the length of the 
genome that are correlated with the disease trait under study. The end result of such a 
. screen is a number of cQTL identified for the disease trait This is graphically depicted in 
Fig. 74. In particular. Fig. 74 illustrates a hypothetical disease-specific genetic network 

20 for disease traits and related co-morbidities. The quantitative trait loci (Ln) and 

environmental effects (panel 7402) represent the most upstream drivers of the diisease 
traits in a given population. In other words, a quantitative disease trait in a segregating 
. population can be described as being made up of genetic and environmental components, 
with or without interactions among the genetic components and/or between the genetic 

25 and environmental components. As depicted in Fig. 74, the QTL and environmental 
effects (7402) influence other "causative" mRNAs (Cwt) (panel 204) singly or in 
pathways that can interact in complicated ways (most generally, as a genetic network), 
but that ultimately lead to the disease state (primary clinical traits). A genetic network 
can be represented as an acyclic directed graph having nodes and edges, where the nodes 

30 represent genes and each respective edge represents confidence that the two nodes, 

connected by the respective edge, are related as determined by an analysis of genotypic 
and gene esqpression data using die methods of the present invention. Variations in the 
causal mRNAs or in the primary cUnical traits can in turn affect reactive mRNAs (RhO 
(panel 7406) in other pathways that in tum lead to co-morbidities of the disease trait, or 



227 



wo 2004/061616 PCT/US2003/041613 

they can provide positive/negative feedback control to the caiisal pathways. Instead of 
restricting the search for disease-causing genes to the QTL regions associated with the 
complex trait, the classic approach in mouse and human genetics, the preset invention 

r 

broadens the search to any of the cellular constituents that operate in the causal portion of 
5 the genetic network associated with the disease trait (circles 7404). Identifying cellular 
constituents in pathways that are under the control of the same QTL that are controlling 
for the disease trait, where the cellular constituents can be shown to act as transmitters of 
information from these multiple QTL to the disease trait itself (as opposed to acting as 
responders to the disease trait), potentially represent key intervention points that can be 
1 0 targeted to modulate the disease trait 

In the absence of cellular constituent abundance data or other molecular 
phenotyping data on the population under study, the biological/biochemical processes that 
take place that ultimately lead to the disease state, starting from the most upstream 
genetic components of the disease detected as QTL, are completely hidden from view. 

i 

15 . Therefore, as depicted in Fig. 74, those pathways (cellular constituents 7404) that are 
impacted by the DNA variations underlying the QTL and that ultimately lead to the 
. disease state (causal), in addition to those pathways that are impacted as a result of the 

system being in the disease state (reactive cellular constituents 7406), are not available for 
. study. , 

20 . The generation of large-scale gene expression data on the relevant populations can 

significantiy expose the many pathways and complicated interactions among cellular 
constituents associated with disease, as detailed by Schadt et aL, 2003, Nature 422, 297. 
The complex networks of interactions that are causal for the disease (7404), as well as 
. those that are reactive to it (7406), make.up the pattems of expression that are associated 

25 with a disease trait. Several examples of this have been provided in the recent literature. 
See, for example, Schadt et al, 2003, Nature 422, 297, van de Vijver et al, 2002, N. 
Engl. J. Med 347; van't Vcct et al, 2002, Nature 415, 530. 

Gene expression traits and disease traits can be modulated by the same QTL. 
. Therefore, performing genome-wide scans to map eQTL for the gene expression traits 

30 allows one to assess the amount of correlation between the gene expression and disease 

> . traits that is due to conmion genetic effects. The QTL provide. anchors in the complex 
network of interactions that lead to disease, and it is this causal information that provides 
for the opportunity to identify cellular constituents 7404 that transmit **infonnation" jfrom 
single or multiple disease QTL, to the disease trait itself Because the QTL can modulate 



228 



wo 2004/061616 PCT/US2003/O41613 

the disease trait through intermediates, identifying the intermediates using the 
combination of genetics and gene expression data (or other cellular constituent abundance 
data) has the potential to elucidate key control points in the complex network associated 
with the disease, 

5 Since one of the primary aims of the target discovery process is to identify targets 

for therapeutic intervention in complex human diseases, it is advantageous to partition 
cellular constituents genes) making up the patterns of expression associated with the 
disease trait and that are modulated by QTL overlapping the disease trait QTL, into two 
groups: 1) cellular constituents under the control of the disease QTL that fall between the 

10 causal and reactive boundaries depicted in Fig. 74 (cellular constituents 7404), and 2) 
cellular constituents that appear to be reactive to the disease state (cellular constituents 
7406). Once cellular constituents have been partitioned into causal set 7404 and reactive 
set 7406, attention can shift to those cellular constituents in' causative set 7404 to identify 
key targets for the disease. 

15 Approaching the dissection of complicated genetic networks associated with 

• disease from this partitioning standpoint greatly simplifies the more general problem of 

reconstructing whole genetic networks. The reconstruction of genetic networks has been 
. vigorously pursued in many settings and has met with some success in microbial 
organisms. See, for example, Marcotte, 1999, Science 285, 751; and Lee et aL, 2002, 

20 Science 298, p. 799. The genetic network reconstruction problem is not yet tractable for 
. mammalian systems, mainly due to the complexity and extent of data that would be 
required to imdertake such a reconstruction. See, for example, van Someren et al, 2002, 
Pharmacogenomics 3, 507. Reducing the genetic network problem to one of partitioning 
sets of cellular constituents should make; the problem tractable and directly relevant to the 

25 identification of targets for complex human diseases. 

The partitioning approach requires that a basic set of causal scenarios be tested to 
determine whether a cellular constituent under the control of disease QTL is causal for the 
disease or reactive to it For each cellular constituent under consideration, first a 
determination is made as to whether changes in the abundance (eg., expression) of the 

30 cellular constituent are associated with QTL that explain variations in the disease trait. 
Theii a determination is made as to whether the QTL act on tiie disease trait through the 
gene. 

Fig. 75A presents the possible relationships between QTL, cellular constituents 
aad disease traits once the abundance of a celltdar constituent {e.g., gene G) and the 



229 



wo 2004/061616 PCT/US2003/041613 

disease trait (T) have been shown to be under control of a common QTL (Q), Pathway 
7502 repres^ts the simplest causal relationship of a single QTL, Q, for the quantitative 
trait T, where Q acts on T through cellular constituent G. Pathway 7504 represents the 
simplest reactive diagram for a single QTL, Q, for the quantitative trait T, where ui this 
5 case the abundance of cellular constituent G is responding to T. In pathway 306, the 
QTL, Q, is causative for the trait T and the abundance of cellular constituent G, but acts 
on these traits independently. Pathway 7508 represents a more complicated causal 
diagram where QTL Q affects the abundance of cellular constituents, and these cellular 
constituents, in turo^ act on the trait T. Pathway 75 10 represents the ideal causal diagram 
10 for target identification, where a number of QTL explain a significant amount of the 

variation in the trait T, but all of these QTL act on T through a single cellular constituent 
G. 

To illustrate how partitioning genes into causal and reactive classes can be 
accomplished given gene e^q^ression data &om a segregating population, consider a 
1 5 hypothetical mouse population in which half of the mice have the AA genotype and the 
other half have the BB genotype at a given locus' As depicted in Fig. 75B, all mice with 

■ 

' the BB genotype are obese, while 87.5% of the mice with the AA genotype are lean and 
the other 12.5% are obese. Further, 87.5% of the BB mice have higher transcript levels of . 
a specific gene, while the other 12.5% have unchanged levels, and similarly, 87.5% of the 

20 AA mice have lower transcript levels of the same gene, while the other 12.5% have ^ 

unchanged levels. If the clinical and expression trait were uncorrelated with the genotype 
at locus L not significantly linked to this locus), it is expected that an equal 
percentage for each of the expression/clinical trait combinations for each genotype at 
locus L. Since Hds is clearly not true in Fig. 75B, the expression and clinical traits are 

25 significantly linked to locus L. 

To determine in this cas&if the mRNA is a cause or consequence of the clinical 
state, the data are fit to the three competing models. Fig. 75C highlights the Causative 
model, where the correlation between genotype and clinical trait predicted firom the 
model is seen to be consistent with the observed correlation. In one embodiment 

30 described below, fins scenario will translate into a situation where the correlation between 

• • . . . 

the clinical trait and genotype, given the gene expression state, is seen to be 0. Because 
the clinical trait and genotype are uncorrelated once we condition on transcript 
abundances, we can tentatively conclude the mRNA is causal for the clinical trait. Fig. 
75D highligihts the Reactive model, where the observed correlation between the gene 

230 



wo 2004/061616 PCTAJS2003/041613 

expression trait and genotype is 0.88, but now the correlation between the gene 
expression trait and genotype given any of the clinical trait values is not equal to 0, e.g., 
the correlation between the expression trait and genotype predicted from the model does 
not equal the observed correlation. Because the expression trait and genotypes are still 

'5 . . significantly correlated aftar conditioning on the clinical trait values, it is possible to 
confirm that the mRNA levels are not responding to the clinical trait Finally, Fig. 75E 
highlights the Independent model, where again the correlation between the gene 
expression and clinical traits predicted from the model is not consistent with the observed 
correlation. Therefore, giyen the results of the fits to these three models, the data for this 

10 hypothetical example indicate that the Causative model is the most parsimonious and thus 
is the best explanation of the underlying biology. It is concluded that the AA/BB locus 
controls variation in the mRNA levels and that this mRNA, in turn, controls variation in 
the clinical trait, rather than the niElNA levels changing as a consequence of the obesity. 
By applying a statistically rigorous vision of this causality testing to the whole genome 

1 5 (described below), the genes controlling variation in mRNA levels that in turn control 
. clinical traits can be identified. In another embodiment, likelihoods are created for each 
of the possible models (independent, causative, and reactive) based on relationships 
depicted in each model and then maximized with respect to model parameters. In this 
other embodiment, the causative model gives rise to the largest likelihood. 

20 • The models in Fig. 75A are theideal, simplest cases. In reality there will -usually 

be a number of loci and mKNAs that cause disease, related by a complex network of 
interactions, as depicted in Fig. 74. In the approach detailed below, this complexity in a 
segregating population can be harnessed to identiiy specific genes that transmit 
information from the disease trait QTL to the clinical disease .trait itself Specially, a 

25 disease trait QTL will modulate the disease trait through intermediates. Identii^g the 
intermediates using the combination of genetics and g^e expression data has the 
potential to elucidate key control points in the complex network associated with the 
disease. 

Figs. 79A and 79B illustrate the processing steps that are performed in accordance 

30 with one embodiment of the present invention. These figures will be referenced in this 

• ...--*.*' 

section in order to disclose the advantages and features of the present invention. 

Step 7902. The present invention begins with the step of obtaining genotype data 
68. Genotype data 68 comprises flie actual alleles for each genetic marker typed in each 



231 



wo 2004/061616 PCT/US2003/041613 

individual in a plurality of individuals under study. In some embodiments, the plurality 
of individuals under study is human. Genotype data 68 includes marker data at intervals 
across the genome under study or in gene regions of interest In some embodiments, such 
data is used to monitor segregation or detect associations in a population of interest. 

5 Marker data comprises those markers that will be used in the population under study to 
assess genotypes. In one embodiment, marker data comprises the names of the markers, 
the type of markers, and the physical and genetic location of the markers in the genomic 
sequence. Exemplary types of markers include, but are not limited to, restriction 
fragment length polymorphisms "RFLPs", random amplified polymorphic DNA 

10 "RAPDs'*, amplified fi-agment length polymorphisms "AFLPs", simple sequence repeats 
"SSRs", single nucleotide polymorphisms "SNPs", microsatellites, etc.). Further, in some 
embodiments, marker data comprises the different alleles associated with each marker. 
For example, a particular microsatelliteimaiker consisting of 'CA' repeats can represent 
ten difiBsrent alleles in the population under study, with each of the ten different alleles, in 

15 ' turn, consisting of some number of repeats. Representative marker data in accordance 
' with one embodiment of the present inventron is found in Section 5.2. In one 
embodiment of the present invention, the genetic markers used comprise single nucleotide 
polymoiphisms (SNPs), microsateUite markers, restriction firagment length 
polymoiphisms, short tandem repeats, DNA methylafion markers, sequence length 

20 polymorphisms, random amplified polymorphic DNA, amplified fragment length 
' polymorphisms, or shnple sequence repeats. 

In some embodiments, step 7902 uses pedigree data. Pedigree data comprises the 
relationships between individuals in the population under study. The extent of the 
relationships between the individuals under study can be as simple as an inbred F2 
25 population, an F/ population, an population, a Designm population, or as complicated 
as extended human family pedigrees. Exemplary sources of genotype and pedigree data 
are described in Section 5.2. 

In some embodiments, a genetic map is generated from genotype data and 
pedigree data. Such a genetic map includes the genetic distance between each of the 
30 maricers present in the genotype data. These genetic distances are computed using 

pedigree data. In some embodiments, the plurality of organisms under study represents a 
segregating population and pedigree data is used to construct the marker map. As such» 
in one embodiment of the preset invention, genotype probability distributions for the 
individuals imder study are computed. Genotype probabihty distributions take into 

232 



wo 2004/061616 PCT/US2003/041613 

account information such as marker infomiation of parents, known genetic distances 
between markers, and estimated genetic distances between the markers. Computation of 
genotype probability distributions generally require pedigree data. In some embodiments 
of the present invention, pedigree data is not provided and genotype probability 
5 distributions are not computed. In some embodiments, a genetic map is not computed. 



Using populations derived from multiple founders 

In some embodiments, the population that is used for the methods illustrated in 
Fig. 79 is a population that is derived from a select set of strains (e.g., a small, but diverse 

Id number of founding mice) or individuals (e.g.^ the Icelandic population, which was 
. founded by a small number of individuals). In some embodiments, between 2 and 1 00, 
between S and 500, more than five, or less than 1000 strainsi of a species diverse with 
respect to complex phenotypes associated with common human disease are chosen. In 
.. some embodiments, the species is mice, In some ^bodiments, between 2 and 10 (e.g., 

15 6) strains of mice that are diverse with respect to complex phenotypes associated with 
common human disease are selected. Representative common human diseases include, 
but are not limited to, obesity, diabetes, atherosclerosis and associated morbidities, 
: . metabohc syndrome, depression / anxiety, osteoporosis, bone development, asthma, and 
chronic obstructive pulmonary disease. The actual nimiber of founding strains is not as 

20 important a factor as ensuring that these '"founders" are diverse so as to introduce 
extensive heterogeneity into the population. In one representative embodiment, the 
species under study is mice and all or a portion of the following strains are used: 
B6_DBA GTMs (Jake Lusis, University of California, Los Angeles), B6_CAST GTMs 
(Jake Lusis, University of California, Los Angeles), B6_DBA Consomics (Joe Nadaeu, 

25 Case Western Reserve University), AXB recombinant inbred (RI) lines (JAX, Bar Harbor 
Maine), BXA RI lines (JAX), LXS IQ lines (Rob Williams, University of Tennessee), 
AKXD RI lines (JAX), 8-way cross mice (Rob Hitzmann, Oregon Health and Science 
University), D129Sl/SvImJ (JAX), A/J (JAX), C57BL/6J (JAX), BALB/cJ (JAX), 
C3H/HeJ (JAX), CAST/Ei (JAX), DBA/2J (JAX), NOD/LtJ (JAX), NZB/BINJ (JAX), 

30 SJIiJ (JAX), AKR/J (JA3Q, CBA/J CJAJO^FVBmj (JAX), and SWR/J (JAX). 

In preferred embodiments, the species that is selected for study using the methods 
illustrated in Fig 79 can be crossed. In such preferred embodiments, crosses (e.g. Fa 
intercross) between all pairs of the founding strains are performed. For example, in one 



233 



wo 2004/061616 PCT/US2003/041613 

embodiment, six founding strains are used so a total of IS crosses are performed. In some 
embodiments, rather that perfomaing an F2 intercross, other cross designs are used. For 
example, in some embodiments, a backcross or F2 random mating scheme is employed. 
In preferred embodiments each of the crosses (for example the 15 crosses using the 6 

5 founder strains) is treated as a single large pedigree. In some embodiments, the jHnal 
population size that is studied has a size of more than 1,000 organisms, between 100 and 
100,000 organisms, less than 500,000 organisms, or, more preferably, between 5,000 and 
25,000 organisms. This population is treated as a single large pedigree and genotype 
information is collected from this population using a standard set of, for example, more 

0 than 500 maricers. 

The advantage of the different crosses and large numbers is that it introduces a 
significant amount of trait heterogeneity into the population, which allows for more 
connections between more pathways relating directly to the diseases of interest, and with 
such large numbers, it will be possible to detect jSrst and second order interactions. 

5 Further, with such large numbers of organisms 46 (Fig. 1) over different strains, there will 
be enough recombination to solve problems regarding describing genetic correlation 
(genetic correlation is a function of linkage disequilibrium and pleiotropy, and in single 
small crosses, these components are confounded). Further, as illustrated below, detection 
of epistatic interactions and minimization of the effects of linkage disequiUbrium oh 

0 genetic correlation would allow for the reconstruction of pathways more reUably. 

Step 7904. In step 7904, the population under study is phenotyped with respect to 
a trait or traits of interest using quantitative trait loci (QTL) analysis in which a 
phenotypic statistic set, representing the trait of interest, is used as the quantitative trait in 
the QTL analysis thereby identifying one or more clinical quantitative trait locus (cQTL) 
:5 that link to the trait. In processing step 7904, a cQTL that is linked to a trait of interest is 
identified using QTL analysis. 

hi some embodiments, a phenotypic statistic set (pluraUty of phenotypic values) 
for the trait of interest serves as the clinical trait used in the QTL analysis. Fig. 80 
illustrates exemplary phenotypic statistic sets. In Fig. 80, each phenotypic statistic set 
0 8000 includes a phenptypic value 8004 for a given phenotype for a each org^^ 
plurahty of organisms under study. As used herein, a phenotypic value is any form of 
measurement of a phenotypic trait associated with the trait of interest (e.^., complex 
disease). For example, if the trait of interest is obesity, a suitable phenotypic trait could 



234 



wo 2004/061616 PCT/US2003/O41613 

include cholesterol level in the blood of the organism. In such an example, the 
phenotypic value can be iniUigrams of cholesterol per hter of blood. 

In one embodiment, processing step 7904 comprises a classical form of QTL 
analysis in which a phenotypic trait is quantified to form a phenotypic statistic set In 

5 some embodiments, processing step 7904 employs a whole genome search of genetic 
markers using the genotypic data firom step 7902. For each genotypic position in the 
genome of the population that is analyzed, processing step 7904 provides a statistical 
measure (e.g., statistical score), such as the maximum lod score between fhe genomic 
position and the phenotypic statistic set Thus, processing step 7904 yields all the 

1 0 positions in the genome of the organism of interest that are linked to the expression 

statistic set tested. Such embodiments of processing step were described by Landa: and 
Botstein in Genetics 121, 174-179 (1989). They are also described in International 
AppUcation WO 90/04651, International AppUcation WO 99/13107, Lander and Schork, 
Science 265, 2037-2048 (1994), and Doerge, Nature Reviews Genetics 3, 43-62, (2002). 

15 In other embodiments of processing step 7904, association analysis, as described, for 
exanq>le, in Section 5.14 is used rather than linkage analysis. 

In one embodiment of the present invention, the QTL analysis (Fig. 79A, step 
7904) comprises: (i) testing for linkage between (a) the genotype of a plurality of 
drganisms at a position in the genome of a single species and (b) the phenotypic statistic 

20 set (eg. , plurality of phenotypic values), (ii) advancing the position in the genome by an 
amount, and (iii) repeating steps (i) and (ii) until all or a portion of the genome has been 
tested. In some embodiments, the amount advanced in each instance of (ii) is less than 
100 centiMorgans, less than 10 centiMorgans, less than 5 centiMorgans, or less than 2.5 
: centiMorgans, or between 2.5 centiMorgans and 500 centiMorgans. A Morgan is a unit 

25 that expresses the genetic distance between markers on a chromosome. A Morgan is 

defin^ as the distance on a chromosome in which one recombinational event is expected 
to occur per gamete per generation. In some embodiments, the testing comprises 
perfomung linkage analysis (Section 5.13) or association analysis (Section 5.14) that 
generates a statistical score for the position in the genome of the single species. In some 

30 embodiments, the testing is linkage analysis and the statistical score is a logarithm of the 
odds (lod) score (Section 5.4). Thus, in some embodiments, a cQTL identified in 
processing step 7904 is rq>resented by a lod score that is greater than 2.0, greater than 
3.0, greater than 4.0, or greater than 5.0. 

« 

235 



wo 2004/061616 PCT/US2003/041613 

In embodiments where more than one cross is considered in step 7902, a separate 
phenotypic statistic set is created for the progeny of each cross. For example, consider 
the case where the phenotypic value under consideration is blood cholesterol level. 
Further, in this example, there are six founder strains and a total of fifteen crosses. In this 

5 example, fifteen phenotypic statistic sets are constructed for blood cholesterol level, one 
for the progeny of each of the fifteen strains. Then, a separate QTL analysis is performed 
with the progeny of each of the fifteen crosses. For each of these crosses, the phenotypic 
statistic set associated with the cross is used as the quantitative trait in the QTL analysis. 
It will be appreciated that a large number of clinical traits can be considered. For each 

10 such clinical trait, measurements of the organisms 46 are made. Then, phenotypic 

statistic s.ets are created for each clinical trait considered. Further, as described above, in 
the case where there are multiple crosses, the phraotypic measurements from the progeny 
of each cross are used to form a respective phenotypic statistic set that is associated with 
the cross. 

15 / Ih some embodiments, the progeny of each cross are subjected to a perturbation 
prior to phenotyping. In some embodim^ts, this p^turbation is a drag treatment, 
variable diet and/or fasting/refeeding. Then, a phenotypic statistic set is created from the 
progeny of the crosses prior to quantitatiye trait loci (QTL) analysis. 

Ih the case where multiple QTL analyses are performed with the same trait, each 
20 such analysis corresponding to the progeny of a dififerent cross in a plurality of crosses, 
. there remains the task of combining the results of each such QTL analysis. For example, 
in the case where the phenotype is blood cholesterol level and there are fifteen crosses in 
the population, fifteen QTL analyses are performed using blood cholesterol as the 
quantitative trait, resulting in fifteen lod score curves across the genome of the species 
25 under consideration- In some embodiments, the lod score curves for the QTL overlapping 
in each of the crosses are combined in an additive fashion to assess the overall 
significance of the QTL over the different crosses. However, this type of method ignores 
the relationship between the crosses that exists if they share a common parent For 
example, if you have two crosses constructed from three inbred lines of mice (so they 
30 share a common parent), then the progeny of each cross will share a larger percentage of 
alleles over the entire genonie than would be expected by chance. By taking this 
relationship into account over the multiple crosses that are present in some embodiments 
of the present invention, a significant increase in the power to detect QTL, detect 



236 



wo 2004/061616 PCTAIS2G03/041613 

interactions between QTL, and detect interactions between QTL and environmental 
conditions is achieved. 

In one embodiment of the present invention, midtiple lod score cyrves, where each 
curve represents a QTL analysis of the progeny of a different cross using a given 
5 quantitative trait, are simultaneously considered. However, rather than simply combining 
the lod score curves in an additive fashion, "identical by descenf ' QSD) matrices are 

calculated. Such matiices assess the probability that any two animals from the different 
crosses have inhaited a common allele at any given position in the genome. These IBD 
matiices are then used to appropriately weight the different distiibutions in the phenotype 

10 of interest that can arise when the phenotype is hnked to a particular region in the 
genome. For example, regions that are likely to have inherited a common allele are 
downweighted relative to regions tiiat are likely to have inherited from diEferent alleles. 

The embodiments that follow in tiiis paragraph apply to instances where the 
species under stiidy are mice. Based on tiiis disclosure, those of skill in the art will 

15 realize corresponding phaaotypes that can be measured in. otiier species and all such 
phenotypes are within the scope of the present invention. In some embodiments, the 
disease of interest is diabetes and/or insutin resistance and the phenotypes that are 
measured in stqp 7904 include plasma glucose, plasma insulin, insulin glucose, and a 
glucose tolerance test (GTT). .In some embodiments, the disease of interest is 

p 

20 atiierosclerosis, and the phenotypes th^ are measured in step 7904 include aortic lesion 
and fatty streak (i. levels, «. paxafihn 5|mi section immunohistochemistey for several 
maricers such as FLAP, 5L0, dendritic cells, T cells, CDUb.mono infiltration, Brdu 
proliferation, apoptosis. Hi. eaidothelial cells and macrophage fimction), brain lesion, 
vascular calcification, paraoxonase, osteopontin, and PAI-1 . in some embodiments, the 

25 disease of interest is obesity, and the phenotypes that are measured in step 704 include 
body weight, anal-nasal length, fat pad weights (e.g., perimetrial fet pad mass, mesenteric 
ommtal fat pad mass, subcutaneous fat pad mass, and retroperitoneal fat pad mass), NMR 
. fet mass, NMR muscle mass, leptin levels, food intake, livo- weight, glucagon, 
adiponectin, and IGF-i. In some embodiments, the disease of interest is hypertension, 

30 and tiie phaiotypes that are measured in step 7904 include blood pressure, and response 
to angiotensin n. In soine embodiments, the disease of interest is asthma and chronic 

obstructive puhnonary disease (COPD) and die phenotypes tiiat are measured in step 704 
include airway hyper-responsiveness with and without antigen challenge and airway 
hyper-responsiveness in mice ejqwsed to smoke for a significant length of time. Jn some 



237 



wo 2004/061616 PCTAJS2003/041613 

embodiments, the trait of interest is plasma lipase activity and the phenotypes that are 
measured in step 7904 include lipoprotein lipase (LPL), hepatic lipase (HL), and 
endothelial lipase activity. In some embodiments, the trait of interest is plasma lipids and 
the phenotypes that are measured in step 7904 include total cholesterol (TC), high-density 

5 lipoprotein cholesterol (EIDL), very low density lipid lipoprotein / low density Hpoprotein 
(VLDLyiDL), triglycerides, fatty acids, ketone bodies, lactate, LDL oxidation, and HDL 
protection, tn some embodiments, the trait of interest is plasma cytokines and the 
phenotypes that are measured in step 7904 include interleukin 6 levels, interleukinl-beta 
levels, tumor necrosis factor alpha/gamma (TNF-alpha/gamma), and interleukin. 4 levels. 

10 In some embodiments, the phenotypes that are measured include monocyte isolation from 
plasma and ELISA or LC-MS for leukotrienes. In some embodiments, the disease under 
study is inflammation and the phenotypes that are measured in step 7904 include 
E06/MDA oxLDL ELISA, lipoprotein.properties, macrophage/T cell interactions, and 
INF-gamma levels. In some embodiments, cardial related traits are of interest and the 

15 phenotypes that are measured in step 7904 include heart/brain weight ratio, heart rate / 
femur length, cardiac fibrosis, and myocardial palcification. In some embodiments, bone . 
traits are of interest and the phenotypes that are measured in step 7904 include bone 
density (scans), femur CT BMD, total femur x-ray BMD, total femur x-ray BMC, femur 
CT-determined BMC, femur diaphyseal BMC, feinur diaphyseal BMD, intertrochanteric 

20 BMC, intertrochanteric BMD, femur volume by CT, femur x-ray area, femur diaphyseal 
. cortical thickness, femur width at the dis^hysis, rig^t and left femur length, rig^t and left 
tibia lengfli, right and left length of forepaw 1'^ 2"^ 3"^, 4^ and 5* digits, right and left 
humerus length, right and left radius length, right and left ulna length, femure widtix at the 
intertrochanteric region, femur fracture energy, stiffiiess of femur, and strength of femur. 

25 Stq> 7906. In step 7906 cellular constituent abundance data 44 (e.g., from a gene 

expression study or a proteomics study) is obtained for a plurality of cellular constituents 
from one or more tissues in each member of the population under study. In some 
embodiments, cellular constituent abundance data 44 comprises the processed microarray 
images for each individual (organism) 46 in a population under study. For example, in 

30 one such embodiment, this data comprises, for each individual 46, cellular constituent 
abundance information 50 for each cellular constituent 48 represented on the array,- 
optional backgroimd signal information 52, and optional associated annotation 
mfonnation 54 describing the probe used for the respective cellular constituent 48 (Fig. 
1). See, for example. Section 5.8, below. 

238 



wo 2004/061616 PCTAJS2003/041613 

In various embodiments of the present invention, aspects of the biological state 
other than the transcriptional state, such as the translational state, the activity state, or 
mixed aspects can be measured and used as cellular constituent abundance data. See, for 
exanrple, Section 5.9, below. For instance, in some embodiments, cellular constituent 
S abimdance data 44 is, in fact, protein levels for various proteins in the organisms 46 under 
study. Thus, in some embodiments, cellular constituent abundance data comprises 
amounts or concentrations of the cellular constituent in tissues of the organisms under 
study, cellular constituent activity levels in one or more tissues of the organisms under 
study, the state of cellular constituent modification (^.g., phosphorylation), or other 
10 measurements relevant to the trait under study. 

In one aspect of the present invention, the expression level of a gene in an 
organism in the population of interest is determined by measuring an amount of at least 
one cellular constituent that corresponds to the gene in one or more cells of the organism. 
In one embodiment, the amount of the at least one cellular constituent that is measured 

1 S comprises abundances of at least one RKA species present in one or more cells. Such 
abundances can be measured by a method comprising contacting a gene transcript array 
* with RNA from one or more cells of the organism, or with cPNA derived therefrom. A 
' gene transcript array comprises a surface with attached nucleic acids or nucleic acid 
mimics. The nucleic acids or nucleic acid mimics are capable of hybridizing with the 

20 . RNA species or with cDNA derived from the RNA species. In one particular 

embodiment, the abundance of the RNA is measured by contactmg a gene transcript array. 

If 

with the RNA from one or more cells of an organism in the plurality of organisms under 
study, or with nucleic acid derived from the jElNA, such that the gene transcript array 
comprises a positionally addressable surface with attached nucleic acids or nucleic acid 
25 mimics, where the nucleic acids or nucl^c acid nmnics are. enable of hybridizing with 
the RNA species, or with nucleic acid derived from the RNA species. 

In some embodiments, cellular constituent abundance data 44 is taken from tissues 
that have been associated with a trait imder study. For example, in one nonlimiting 
embodiment where the complex trait under study is human obesity, cellular constituent 
30 abundance data 44 is taken fiiom the liver, brain, or adipose tissues. More generally, in 

• * « ■ • * 

some embodiments of the present invention, cellular constituent abimdance data 44 is 
measured from multiple tissues of each organism 46 (Fig. 1) under study. For example, 
in some embodiments, cellular constituent abundance data 44 is collected bom one or 
more tissues selected from the groiq) of Uvct, brain, heart, skeletal muscle, white ajJipose 

239 



wo 2004/061616 PCT/US2003/O41613 

firom one or more locations, and blood, in such embodiments, the data is stored in a data 
structure such as data structure of Fig. 83. This data structure is desoibed in more detail . 
below. 

In some embodiments, particularly in embodiments where multiple crosses are 
5 simultaneously considered, each progeny mouse (and a number of parental and Fl mice) 
are ext^isively phenotyped by collecting multiple tissues from each such mouse for 
expression profiling. For example, tissue samples that can be collected for profiling 
include, but are not limited to, brain (possibly different brain parts), liver, white adipose 
tissue, skeletal muscle, heart, blood, kidney, lung, intestine, and stomach, hi some 
10 embodiments, expression profiles for at least three of these tissues across some numb^ of 
animals is performed. This rich set of clinical/biochemical phenotypes and gene 
expression traits over many tissues across multiple crosses allows for reconstruction of 
pathways involved in any of the clinical traits represented. 

In some embodiments, once cellular constituent abundance data has been 

15 assembled, tiie data is transformed into abundance statistics that are used to treat each 

cellular constituent abundance in cellular constituent abimdance data 44 as a quantitative • 
trait. In some embodiments, cellular constituent abundance data 44 (Fig. 1) comprises 
gene expression data for a pluraUty of genes (or cellular constituents that correspond to 
the plurality of genes). In one embodiment, the plurality of genes comprises at least five 

20 genes. In another embodiment, the plurality of genes comprises at least one hundred 
genes, at least one thousand genes, at least twenty thousand genes, or more than thirty 
thousand genes. The expression statistics coromonly used as quantitative traits in the 
analyses in one embodiment of the present invention include, but are not limited to the 
mean log ratio, log intensity, and background-corrected intensity. In other embodiments, 

25 other types of expression statistics are used as quantitative traits. In such embodiments, 
the expression levels of a plurality of genes in each organism under study are normalized. 
Any normalization routine can be used. Representative normalization routines include, 
but are not limited to, Z-score of intensity, median intensity, log median intensity, Z-score 
standard deviation log of int^ity, Z-score mean absolute deviation of log intensity 

30 calibration DNA gene set, user normalization gsao set, ratio median intensity correction, 
and intensity background correction. Furthermore, combinations of normalization 
routines can be used. Exemplary normalization routines in accordance with tixe present 
invention are disclosed in more detail in Section 5.3, below. The expression statistics 



240 



wo 2004/061616 PCT/US2003/041613 

formed from the transformation are then stored in abundance / genotype warehouse 76, 
where they are ultimately matched with the corresponding genotype information. 

Once cellular constituent abundance data has been transformed into corresponding 
expression statistics and a genetic marker map has been constructed, the data is 
5 transfomied into a structure that associates all marker, gmotype and expression data for 
input into QTL analysis software. This structure is stored in abundance / genotype 
warehouse 76, 

Step 7908. Given gene expression data for a specific tissue of interest in a 
population that has been genotyped and phenotyped with respect to a disease trait of 

10 interest, the next step is to identify all cellular constituents that are significantly 
associated wifii the disease trait. A variety of methods can be used to establish 
associations between cellular constituent abundance and clinical traits, including simple 
Pearson correlations, basic discriminant analysis, t-tests, and ANOVA, in order to 
identify those cellular constituent abundance values that discriminate the extremes of the . 

15 clinical trait, as well as more advanced regression models that specifically assess 

• relationships between cellular constituent abundance values and clinical traits. In some - 
embodiments, only the cellular constituents that are differentially expressed in at least ten 
percent, at least twenty percent, or at least thirty percent of the organisms profiled are 
considered. Then, ofthesediflferentiaUy expressed cellular coiistituents,oiily those 

■ 

20 cellular constituents whose abundance yaluse across the population has a Pearson 

correlation coefficient (p-value) that is less tiian 0.00001, 0.0001, 0.001 or 0.01 with the 
trait of interest T, as exhibited by organisms profiled, are considered. The product of step 
708 is a set of cellular constituents (association set D) whose abundance levels across the 
population under study significantly associate with the trait of interest. 

25 To illustrate, consider the hypothetical cellular constituent A in a population of 

100 organisms. If just one tissue is considered in this population, then there will be 100 
abundance values for cellular constituent A, one firom each of the 100 organisms. 
Likewise, there will be 100 measurements of the trait of interest (e.g., tail length), one for 
each of the 100 organisms. In step 708, then, the question is asked whether the 100 

30 ' cellular constituent abundance values significantiy correlate with the 1 00 trait 

measurement values. As indicated above, a statistical measure, such as the Pearson 
coirelation coefficient between the abundance value and the Trait measurements, can be 



241 



wo 2004/061616 PCT/US2003/O41613 

used If a certain threshold coirelation value or other metric is achieved, the cellular 
constituent is considered significantly associated with the trait. 

In some embodiments, multiple crosses are considered simultaneously. For the 
purposes of step 7908, the progeny of the multiple crosses can be treated as a single large 

S populatioiL So that, for example, if there are fifty organisms fix)m a first cross and fifty 
organisms 6om a second cross, the combined total of 100 organisms is treated as a single 
population. Alternatively, the progeny of each cross can be considered independently. 
Thus, in the example where there are two crosses, each with fifty progeny, an 
independent determination can be made of the cellular constituents whose abundance 

10 levels significantly associate with the trait of interest. Then the test sets of cellular 

constituents that associate with the trait in the respective crosses can be combined. For 
instance, consider the case where cellular constituents A and B significantiy associate 
with the trait in the progeny of a first cross and cellular constituents B and C significantly 
associate with the trait in the progeny of the second cross. Jn this instance, the sets can be 

15 combined such that step 7908 realizes an association set D comprising cellular 

. constituents A, B, and C. There are any number of rules that can be devised to combine 
the results when crosses are considere4 separately in step 7908. The case of single 
addition (eg.. A, B, and C) has been presented above. Alternatively, only those cellular 

■ 

constituents that are significantly associated with the trait in all the crosses (or a majority 
20 of the crosses or some other percentage of the crosses) are placed in association set D. 

Step 7910. Jn step 7910, a quantitative trait locus (QTL) analysis is performed 
using data corresponding to each cellular constituent i in association set D. For 1,000 
cellular constituents, this results in 1,000 separate QTL analyses. For embodiments in 
which multiple tissue samples are collected for each organism, this results in even more 

25 separate QTL analyses. For example, in embodiments in which samples are collected 
fi:om two different tissues, an analysis of 1,000 cellular constituents can require 2,000 
separate QTL analyses. In embodiments where multiple crosses are considered, the 
crosses are preferably considered in the QTL analysis as a single populatioiL In one 
example, each QTL analysis steps throu^ the genome of the organism of interest 

30 Linkages to the gene under consideration are tested at each step or location along the 

length of the genome. In such embodiments, each step or location along the length of tiie 
chromosome is at regularly defined intervals. In some embodiments, these regularly 
defined intervals are defined in Morgans or, more typically, centiMorgans (cM). In other 



242 



wo 2004/061616 PCT/US2003/041613 

anbodiments, each regularly delmed mtervai is less than lU cM, less than 5 cM, or less 
than 2.5 cM. 

In each QTL analysis, data, corresponding to a cellular constituent selected from 
discriminating set D, is used as a quantitative trait More specifically, for any given 
5 cellular constituent i, the quantitative trait used in the QTL analysis is an abundance 
statistic set such as set 8104 (Fig. 81). Abundance statistic set 8104 comprises the 
corresponding abundance statistic 8108 for the corresponding cellular constituent 8102 
from each organism 81 06 in the population under study. Fig. 82 illustrates an exemplary 
abundance statistic set 8104 in accordance with one embodiment of the preset invention 

10 for the case in which abundance data from only one tissue type is considered and cellular 
constituent abxmdance is gene expression. The exemplary abundance statistic set 8104 of 
. Fig. 82 includes the abundance level 8108 of a gene G (or cellular constituent that 
corresponds to gene G) from each organism in a plurahty of organisms. For example, 
consider the case where there are ten organisms in the plurality of organisms^ and each of 

IS . the ten organisrns expresses gene G. In this case, abundance statistic set 8104 includes 
ten entries, each entry corresponding to a different one of the ten organisms in the 
plurahty of organisms. Further, each entry represents the abimdance level (e.g., 
oqpression level) of gene G in the organism represented by the entry. . So, entry "1" 
(8108-G-l) (Fig. 82) corresponds to the abundance level of gene G in organism 1, entry 

20 '*2*' (8108-G-2) (Fig. 82) corresponds to the abundance level of gene G in organism 2, 
and so forth. 

Referring to Fig. 83, in some embodiments of the present invration, abimdance 
data from multiple tissue samples of each organism 8106 und^ study are collected. 
When this is the case, the data can be stored in the exemplary data structure illustrated in 

* 

25 Fig. 83. Fig. 83, a plurality of cellular constituents 8102 are repres^ted. Further, 
there is an abundance statistic set 8104 for each cellular constituent 8102. Each 
abundance statistic set 8104 represents an abundance of the corresponduig cellular 
constituent in each of a plurahty of organisms. 

In one embodim^t of the present invention, each QTL analysis (Fig. 79A, step 
30 7910) comprises: (i) testing for linkage between a position in a genome and an abundance 
statistic set 7904, (ii) advancing the position in the genome by an amount (e.g., less than 
100 cM, less than 3 cM), and (iii) repeating steps (i) and (ii) until the entiire genome is 
tested. In some embodiments, testing for linkage between a given position in the genome 

243 



wo 2004/061616 PCT/US2003/041613 

and the abundance statistic set comprises correlating differences in the abundance found 
in the abundance level statistic with differences in the genotype at the given position 
using single marker tests (for example using ^-tests, analysis of variance, or simple linear 
regression statistics). See, e.g.^ Statistical Methods^ Snedecor and Cochran, 1985, Iowa 

5 State University Press, Ames, Iowa. However, there are many other methods for testing 
for linkage between abundance statistic set and a given position in the chromosome. In 
particular, if abundance statistic set is treated as the phenotype (in this case, a quantitative 
phenotype), then methods such as those disclosed in Doerge, 2002, Mapping and analysis 
of quantitative trait loci in experimental popxilations. Nature Reviews: Genetics 3:43-62, 

10 may be used. Concerning steps (i) through (iii) above, if the genetic length of a given 
genome is N cM and 1 cM steps are used, then N different tests for linkage are 
performed- 

In some embodiments, the QTL data produced jGrom each respective QTL analysis 
comprises a logarithm of the odds score (lod) computed at each position tested in the 

15 genome under study. A lod score is a statistical estimate of whether two loci are likely to 
lie near each other on a chromosome and are therefore likely to be genetically linked- In 
the present case, a lod score is a statistical estimate of whether a given position in fhe 
genome under study is linked to the quantitative trait corresponding to a given gene. Lod 
scores are further defined in Section S .4, below. In some embodiments, a lod score of 2.0 

20 or more is generally taken to indicate that two loci are genetically hnked In some 
embodiments, a lod score of 3.0 or more is generally taken to indicate that two loci are 
genetically linked. In some embodiments, a lod score of 4.0 or more is generally taken to 
indicate that two loci are genetically linked. The generation of lod scores requires 
pedigree data. Accordingly, in embodiments in which a lod score is generated, 

25 processing step 7910 is essentially a linkage analysis, as described in Section 5.13, with 
the exception that the quantitative trait imder study is derived from data, such as cellular 
constituent expression statistics, rather than classical phenotypes such as eye color. 

In situations where pedigree data is not available, genotype data firom each of the 
organisms 46 (Fig. 1) can be compared to each abundance statistic set using allelic 
30 association analysis, as described in Section 5.14, in order to identify QTL that are linked 
to each expression statistic set In one form of association analysis, an affected 
population is compared to a control population. In particular, haplotype or allelic 
firequencies in the affected population are compared to haplotype or alleUc frequencies in 
a control population in order to determine whether particular haplotypes or alleles occur 

244 



wo 2004/061616 PCTAJS2003/041613 

at sigoificantly higher fiequeacy amongst afiected compared with control samples. 
Statistical tests such as a chi-square test are used to determine whether there are 
differences in allele or genotype distributions. 

Regardless of whether linkage analysis or association analysis is used in step 
5 7910, the results of each QTL analysis can be stored in a QTL results database (Fig. 84). 
For each abundance statistic set (Fig. 81), QTL results database comprises all tested 
positions in the genome of the organism that were tested for linkage to the quantitative 
trait For each position 8104, genotype data 68 provides the genotype at position 8404 for 
each organism in the plurality of orgaiiisms under study. For each such position 8404 
10 analyzed by quantitative genetic analysis in step 7910, a statistical measure 

statistical score 8406), such as the maximum lod score between the position and the 
abundance statistic, is listed. Thus, data structure comprises all the positions in the 
genome of the organism of interest that are genetically linked to each abundance statistic 
tested. 

IS Step 7912, In step 7912, those cellular constituents in association set D that do 

not have at least one eQTL coincident with at least one cQTL &om step 7904 form a 
candidate reactive cellular constituent set (Fig. 74, 7906). All cellular constituents in 
association set D that have at least one eQTL coincident with at least one cQTL from step 
7904 form a candidate causal cellular constituent set (Fig. 74, 7904). In some 

20 embodiments, an eQTL is coincident with a cQTL when the eQTL and flie cQTL 

colocalize within 40 cM of each other, within 30 cM of each other, within 20 cM of each 
other, within 10 cM of each other, within 3 cM of each other, or within 1 cM of each 
other in the genome of the species under consideration. 

As an example of step 79 12, consider the case in which the phenotypic statistic set 
25 is omental fat pad mass in a mouse population and that a QTL analysis in accordance with 
step 7904 yields 5 cQTL with LOD scores over 2.0 located on chromosomes 1 at 1 1 IcM, 
S at 90cM, 6 at 43cM, 9 at 8cM, and 19 at 28cM. All cellular constituents in association 
set D that form eQTL at any of these chromosomal locations will be placed in the causal 
candidate cellular constituent set O^^ig. 74, 7904). All cellular constituents in association 
30 set D that do not form eQTL at any of these chromosomal location will be placed in the . 
reactive candidate cellular constituent set (Fig. 74, 7906). 

Each cellular constituent in the candidate causal cellular constituent set gives rise 
to at least one eQTL that overlaps with at least one cQTL from step 7904 (an eQTL/cQTL 



245 



10 



wo 2004/061616 PCT/US2003/041613 

overlap). There are generally two reasons that two or more traits (here an eQTL and a 
cQTL) can be genetically correlated: 1) gametic phase disequilibrium (also known as 
linkage disequilibrium) and 2) a single gene affecting multiple traits (pleiotropy). In 
some embodiments of the present invention, in order for an eQTL and a cQTL to be 
J coincident, the QTL associated with the position of the eQTL and cQTL must truly be 
common to the clinical and e7q)ression trait (due to a pleiotropic effect of a common 
QTL) rather than simply represent two closely linked QTL (due to linkage disequilibrium 
between two distinct QTL). In such embodiments, a test is implemented to test the 
positions between the eQTL and the cQTL to determine whether the positions are 
statistically indistinguishable. 



15 



20 



In considering a test for pleiotropy in accordance with the present invention, let 
and represent quantitative trait random variables, with QTL Q and at positions p, 
and P2 , respectively. It is of interest to determine whether Pi-Pz, indicating a 
pleiotropic effect at the QTL for traits 1^ and 1^ . Jiang and Zeng, 1995, Genetics 140, 

1 1 1 1, devised statistical tests to assess whether the positions are equal. A generalization 
of this test is implemented in some embodiments of step 7912. Since the positions under 
consideration usually will be relatively close together on a given chromosome (e.g., 
wifliin 20 cM), it is expected that and 1^ will be correlated, and so the most basic 
model for these traits under the control of a single, common QTL is formed as: 



(A] 



Q+ 



where g is an categorical random variable indicating the genotypes at the position of 



interest, and 



is distributed as a bivariate normal random variable with mean 



0 



and covariance matrix 



f 2 



The case where = p2 represents the null hypothesis of pleiotropy. The aim is 
25 to test this null hypothesis against a more general alternative hypothesis that indicates 
Pi^Pz' The alternative hypotheses of interest can be captured by the following model: 



<y2j 



r 



246 



wo 2004/061616 PCT/US2003/041613 

where the are distributed as for the pleiotropy model. The null hypothesis can be 

compared against any of a series of alternative hypotheses. The likelihoods for the two 
competing models (null hypothesis and alternative hypothesis) are easily fonned, and 
maximum likelihood methods are then employed to estimate the model parameters 
5 (/i,, fij^ and a^, ). With the maximum likelihood estimates in hand, the likelihood ratio 

test statistic can be formed to directly test the null hypothesis against the alternative. 

There are several alternative hypotheses that can be tested in this setting 
' including: 

indicating closely linked QTL with no pleiotropic effects, 



15 



indicating closely linked QTL with pleiotropic effects at the first position. 



20 mdicating closely linked QTL with pleiotropic effects at the second position, and 



indicating closely linked QTL with pleiotropic effects at both positions. Other null 
25 hypotheses and corresponding alternative hypotheses naturally follow firom the general 
models presented here. 

Thus, in embodiments where a pleiotropy test is applied, each cellular constituent 
in the candidate cellular constituent has at least one eQTL that is coincident with a 
respective cQTL for the trait of interest, where tiie at least one eQTL passes a test for 
30 pleiotropy witti the respective cQTL. 

Step 7916. hi step 7916, the cellular constituents in the candidate causative 
cellular constituent set are ranked ordered based upon the amount of genetic variation in 
the trait of interest that is explained by the eQTL of the cellular constituent that are 

247 



wo 2004/061616 PCT/US2003/041613 

coincident with cQTL from the trait of interest. More specifically, for each cellular 
constituent i in the candidate causative cellular constitu^t set, a determination is made as 
to the amount of genetic variation in the trait of interest that is explained by the eQTL of 
the respective cellular constituent 1 coincident with the cQTL from the trait of int^est 
5 Then, the cellular constituents in the candidate causative cellular constituent set are rank 
ordered based upon the amount of genetic variation in the trait of interest that is explained 
by each ceUular constituent detennined in this maimer. 

To illustrate, consider the case in which the trait of interest produces five cQTL. 
Further, a cellular constituent i in the candidate causative cellular constituent set has five 

10 eQTL. Four of the eQTL overlap with four of the cQTL for the trait of interest. 

However, only three of the eQTL pass the test for pleiotropy. In this example, only the 
three eQTL that are coincident with respective cQTL for the trait of interest and that pass 
the test for pleiotropy described in step 7912, above, are used to determine how well they 
explain the genetic variation in the trait of interest Thus, in the example, if the first of 

15 the three qualifying eQTL explains ten percent of the genetic variation in the trait of 

■ 

interest, the second of the three qualifying eQTL explains twenty percent of such genetic 
variation, and the third eQTL explains thirty percent of such genetic variation, the three 
eQTL, together, explain sixty percent of the genetic variation in the trait of interest. 

In some embodiments, the determination as to how much the qualifying eQTL of 

20 a given cellular constituent explain the genetic variation in the trait of interest is 

performed using a joint analysis of the trait of interest at each of the qualifying coincident 
eQTL. This joint analysis leads to a lod score as described by Jiang and Zeng, 1995, 
Genetics 140, p. 1 1 11 and applied by Schadt et al., 2003, Nature 422, p. 297, to gene 
expression traits. Then, cellular constituent can be rank ordered based on their lod score. 

25 Step 7918. . Step 7918 tests the cellular constituents in the candidate causative 

cellular constituent set in a manner that is independent of the pleiotropy test of step 791 6. 
Step 7918 applies a causality test that, in one embodiment, serves to determine whether 
the genetic variation in each eQTL of a given cellular constituent that is coincident with a 
cQTL of a trait of interest is correlated with the variation in the trait of interest 

30 conditional on an abundance pattem of the cellular constituent i in the plurality of 

specific tests can be developed to identify the true relationship between QTL (Q), 
cellular constituent abundance (G) and disease trait (T) fiom tiie set of possible 
relationships depicted in Fig. 75 A. However, to maximize the information that can be 

. 248 



wo 2004/061616 PCT/US2003/041613 

derived from the genetics and expression data, the causality test used in step 791 8 is best 
considered in the cont^t of scenario 7910 of Fig. 79A. Scenario 7910 represents an 
optimal situation where a cellular constituent (e.g., gene) is under the control of multiple 
disease QTL and still causative for the disease, thereby providing maximal causal 
5 information relating to the disease under study. 

The aim of the causality test is to distinguish between the relationships that 
indicate a cellular constituent is causal for the clinical trait (scenarios 7902, 7908, and 
7910 of Fig. 79A) firom those that are reactive to, or independ^t of the disease trait 
(scenarios 7904 and 7906, respectively, of Fig. 79A). The test for causality involving 

10 QTL, cellular constituent abundance (e.g., gene expression) and disease trait data is based 
on the same conditional probabilities that underUe mutual information measures that form 
the basis of the more general Bayesian network reconstruction problems. See, for 
example, Pearl, 1983, Probablistic Reasoning in Intelligent Systems: Networks of 
Plausible Inference, Morgan Kaufinan PubUshers, Inc., San Francisco. The causality test 

15 assesses whether the QTL (Q) and the disease trait (7) are correlated conditional on the 
cellular constituent abundance trait (G). 

Genetic linkages for disease and cellular constituent abundance traits give rise to 
. information on causality, thereby restricting the number of relationships to consider since 
they establish sub-relationships with absolute certainty (e.g., it is known that Q causes 

20 variations in G and 7). This restriction allows for a robust, statistical test to detennine 
whether scenarios 7902 and 7910 of Fig. 79A hold over the relationships given by 
scenarios 7904 and 7906. Since the test begins with data that indicate G and Tare 
partially under the control of a common QTL Q, the problem is significantly simplified 
over that of the classic network reconstmction problem, where positioning G with respect 

25 to r would require additional traits related to G and T. If one started with no a priori 
information on causality between the traits, the exact relationship could not be 
unambiguously identified without additional experimentation. See, for example. Pearl, 
1983, Probablistic Reasoning in Intelligent Systems: Networks of Plausible Inference, 
Morgan Kaufinan PubUshers, Inc., San Francisco. 

30 If it is assumed that traits T and G are jointly distributed as a bivariate normal 

random variable with a common QTL between them, then a determination can be made as. 
to whether the following relationship holds: 

PiT,Q\G)^P{T\G)P{Q\G), 

where the P '5 represent probability density fimctions and, by definition, 

249 



wo 2004/061616 



PCTAJS20O3/O41613 



P{T,Q\G) = 



PiT\G) 



P(T,G) _ PiG\Q)PiQ) 
P{G) P(G) 



and 



P(S|G) = 



P(Q.G) 
P{G) 



Here, P(T,<^G)is read, "the probability of T and Q given G". This relationship 

5 P(T, Q\G) = PiT\G)PiQ\G) indicates that evrai though T and j3 can be significantly 

correlated (this holds by definition for a QTL), conditioning on relative abundances G 
leads to fimctional independence between Q and T, as was noted in the example for 
Figure 79C. If this relationship holds, then it can be concluded that the information 
passed from Q to disease trait J is via G, which supports G as being causal for T. See, for 
10 example, Pearl, 1983, Probabilistic Reasoning in Intelligent Systems: Networks of 

Plausible Inference, Morgan Kaufinan Publishers, Inc., San Francisco, Section 3.1.2. If 

conditional on G.QmdT are not independent (e.^.. P(r, Q\G) ^ P{J\G)P{^G) ), then 



15 is tested by fbrst forming the likelihood functions based on flie conditional probabilities 

■ 

discussed above, for the two competing hypotheses: 1) the null hypothesis that T and Q 
are independent given G (G is causal for 7), and 2) the alternative hypothesis that T and Q 
are dependent given G (G is not causal for 7). The likelihood functions can then be 
maximized with req)ect to the parameters of the underlying genetic model, and the 

20 likelihood ratio test statistic formed, which in the present case, imder the null hypothesis, 
would be chi-square distributed with two degrees of freedom. For more information on 
the likelihood functions and likelihood ratio statistics used, see Section 5.19.4, below. 

hi one embodiment, the correlation between Zand Q is considered in terms of a 
LOD score. Significant coirelation between T and Q is consistent with a significant LOD 

25 score for T at position Q. Aft^ conditioning on the gene e7q>ression trait G, the causality 
test determines whether there is still a significant LOD score for TBtQ. If the LOD score 
for the QTL drops to zero after conditioning on G, this indicates G effectively blocks 
transmission of the information from the QTL to the trait, indicating that scenario 7902 
(Fig. 79 A) is the more likely e3q)lanation of the relationship between T and G (or one of 



one of the relationships given in scenarios 7904 and 7906 more likely holds (the 

* * ■ 

relationships in these Figures can be tested in a like manner). Conditional independence 



250 



wo 2004/061616 PCT/US2003/041613 

the variants given in scenarios 7908 or 7910 of Fig. 79 A). While this form of the null 

hypothesis given above has interesting statistical issues to consider, given causality is 

assumed under the null hypothesis, it is consistent with the traditional null hypothesis of 

linkage analysis that a given trait is not linked to a particular locus under consideration. 

* - - . . - , *. 

5 Those cellular constituents in the candidate causative cellular constituent set in 

. which the null hypothesis of causality is accepted for all of their associated eQTL 

overlapping with (coincident with) cQTL represent the strongest set of causal candidates 

for the trait of interest. 

In another embodiment, models 7902 (causative), 7904 (reactive), and 7906 

10 (independent) of Fig. 79A are compared directly using a maximum likelihood approach. 

In this approach, for each model (independent, causative and reactive), the following 

< * 

likelihoods are fonned based on the relationships depicted in the model: 



15 



model 7902 (causative) P(Q,G,T) = P{G\Q)P{I\G) 
model 7904 (reactive) P(Q,G,T) = P(j|0P(G|2O 
model 7906 (independent) P(Q,G,T) = i*(r|0F(G|0 

• t 
• f 

20 where, as in Fig. 79 A, Q is the DNA locus controlling cellular constituent levels and/or 
clinical traits, G is cellular constituent level, and T is clinical trait. The likelihoods are 
thm maximized with respect to the model parameters, given the genotypic data, cellular 
constituent abundance data 44, and phenotype data 72 (Fig. 1) for the trait (or traits) of 
interest. These maximum.likelihood values are then compared using standard techniques, 

25 where Ihe model giving rise to the largest likelihood is declared the best model. 

To illustrate, consider the case in which a particular trait T, say X, in which 3.3 
. percent of the traif s variation is explained by a single QTL. Let Y be another trait such 
that X is partially causal for Y and the QTL that ^plains 3.3 percent of X's variation only 
explains 1.1% of Y's variation in a given population. Further, the coefificienf of 

30 determination between X and Y is only 0. 1 (so ten percent of Y's variation is explained 
by the variation in X). Clearly, if X and Y were expression or clinical traits, the degree of 
association between X and Y here would not be striking and, in fact, would most likely be 
missed using conventional techniques such as agglomerative hierarchical clustering of the 
data. 

35 Table 1 below gives the Akaike Information Criterion (AIQ for three models in 

this case (the AIC value is defined as -2 times the loglikelihood added to two times the 

251 



wo 2004/061616 PCT/nS2003/041613 

number of parameters in fhe model). The AIC is used to select the "best" model from a 
list of theoretical functions. See, for example, Akaike Information Criterion Statistics 
Mathematics and Its Applications^ Japanese Series, Sakamoto et al, D Rddel Pub Co, 
January 1987. The model with the smallest AIC value represents the model that bests fits 
5 the data and therefore has the highest likelihood given the data. 

Table 1 

LOD scores AIC for model AIC for model AIC for model 

(X/Y) 306 302 (causal) 304 (reactive) 

(independent) 

7.3/2.4 13354.5 13254.3 13276.8 

From Table 1, it can be seen that causality model 7902 provides the best fit to the data, as 

10 would be expected given the hypothetical data. Next, a determination is made as to 

whether the difference in AIC values is statistically significant. Differences between AIC 
values essentially represent a likelihood ratio test statistic with one degree of fireedom (in 
this case). These statistics are chi-square distributed when the models axe nested, so if 
this were the case here, then the p-value associated with the difference in AIC values 

1 5 . between the. causal and reactive model would be 0,000002 (indicating statistical 

significance). However, the models in the hypothetipal case are not nested, and so the 
standard likelihood ratio test theory does not strictly apjply but can be used as an 
approximate test to determine whether the AIC values are statistically significant. 
Permutation testing can also be used to assess the significance of the AIC 

20 differences. If the trait values are permuted in a way that maintains the correlation 

betwem them, but randomizes them with respect to the genotypes, an assessment can be 
made as to whether the observed differences are as big as those observed from the actual 
data. In this present example, 1000 permutations were tested and iu no case was the 
difference between the causal and reactive models as large as it is in Table 1. This 

25 exanq)le demonstrates the power of the new causality test. It is effectively able to 
identify a strong causal relationship between two traits that wctc only moderately 
associated and weakly linking to a common QTL. 

To fijrther highlight the utility consideration of genotypic infonnation (Fig. 1) 
brings in resolving this causal relationship between these moderately associated traits, the 

30 genotypes were randomized at the locus to which the two traits link. This effectively 



252 



wo 2004/061616 PCT/US2003/O41613 

destroys the genetic association between the traits and the locus. The resulting AIC 
values for each of the models is given in Table 2: 



Table 2 

LOD scores AIC for model AIC for model AIC for model 
QUY) 306 302 (causal) 304 (reactive) 

(independent) 

# 

7.3/2.4 13397.9 13287.0 13287.5 

5 

Interestingly, the causal and reactive models were significantly better than the 
independent model, indicating the models were still able to capture the correlation 
structure between the traits (so randomizing the genotypes does not affect the correlation 
structure between the two traits), but the AIC values for the causal and reactive models 

10 are now statistically equivalent That is, the causality between these associated traits can 
no longer be established because the genotypic information was destroyed. 

To demonstrate how this procedure can also be used to discriminate between traits 
related in a causal/reactive way &om those related in an independent way (/. e., linked to 
the same QTL but otherwise independent), a data set for traits Q and Z, where both traits 

15 are strongly linked to the same QTL, but are otherwise independent, was tested 

inventive procedure. The results of the analysis are given in Table 3. Here, despite traits 

a 

f ' * 

Q and Z being very strongly linked to the same locus, with trait Q significantly more 
strongly linked to the locus, the independent model fits the data much better than the 
oth^r two alternatives: 

20 

Table 3 

LOD scores 
(Q/Z) 

37.8/21.5 



AIC for model AIC for model AIC for model 

306 302 (causal) 304 (leactive) 
(indq)eiident) 

9202.8 9288.5 9361.1 



Step 7920. In optional step 7920, a determination is made as to whether the - 
cellular constituents in the candidate causative cellular constituent set are druggable. 
25 Hopkins and Groom, 2002, Nature Reviews 1, p. 727 provides one definition of a 

druggable target To develop a definition of a draggable genome, Hopkins and Groom 



253 



wo 2004/061616 PCT/US2003/041613 

identified the molecular targets to rule-of-five compUant compounds. As put forth by 
Lipinski et aL, 1997, Adv. Drug Deliv. Rev. 23, 3, a rule-of-five compliant synthetic 
compound {e.g., compounds other than those derived fi-om natural products) has less than 
five hydrogen-bond donors, the molecular mass of the compound is less than 500 
S Daltons, the Hpophihcity is less than 5, and the sum of the nitrogen and oxygen atoms is 
less than 10. A thorough review of the hterature by Hopkins and Groom identified 399 
non-redundant molecular targets that have been shown to bind rule-of-five compUant 
compounds with binding afBnities below 10 pM. Next, Hopkins and Groom took the 
dmg-binding domains of the 399 non-redundant molecular targets and determined the 

10 families that they represent, as c^tured by their hiterPro domain (Hopkins and Groom, 
2002, Nature Reviews 1, p. 727; Apweiler et al, 2001, Nucleic Acids Res. 29, 37). A 
total of 130 protein families rqpresent the 399 non-redundant molecular targets. These 
protein famihes are provided in the online supplemental information for Hopkins and 
Groom, 2002, Nature Reviews Drag Discovery 1, p.727 at 

IS www.nature.com/r6vi6ws/dmgdisc and include G-protein coupled receptors, 

serine/threonine and tyrosine protein kinases, zinc metallo-peptidases, serine proteases, 
nuclear hormone receptors and phosphodiesterases. Thus, in one embodimoit of the 
present invention step 7920 comprises detemiining whether each cellular constituent in 
the candidate causative cellular constituent set includes a draggable domain as defined by 

20 . Hopkins.and Groom. 

Other methods for defining whether a given cellular constituent includes a 
draggable domain are available and any such definition can be used in optional step 7920. 
For example, in a comprehensive review of the accumulated portfoho of the 
pharmaceutical industry, Drews, 1996, Nature Biotechnol. 14, 1516 and Drews and 

25 Ryser, 1997, Nature Biotechnol. 15, 13 18 identified 483 molecular targets and concluded 
there could be 5,000-10,000 potential targets on the basis of an estimate of the number of 
disease related genes. See, Drews, 2000, Science 287, 1960. Thus, in one embodiment of 
the present invention, the molecular targets idratified by Drews are considered the class 
of cellular constituents that have a draggable domain. In still another ^bodiment of the 

30 present invention, the class of cellular constituents that have a draggable domain are any 
cellular constituents that are the molecular target of any drag product that has been. . 
approved under section 505 of the United States Federal Food, Drag, and Cosmetic Act. 

St^ 7922. In optional step 7922, the cellular constituents in the candidate 
causative cellular constituent set are ranked and filtered based on the rank assigned in stq) 



254 



wo 2004/061616 PCT/US2003/041613 

7916 and/or the results of steps 791 8 and 7920. A purpose of optional step 7922 is to 
reduce the number of cellular constituents under consideration as molecular targets of a 
therapeutic drug discovery program directed at alleviating the trait under study. As such, 
optional ranking step 7922 serves to prioritize the cellular constituents and/or filter out 
5 cellular constituents from the candidate causative cellular constituent set. In some 

embodiments, for example, the only cellular constituents that are allowed to remain in the 
candidate causal cellular constituent set are those cellular constituents that (i) are highly 
ranked in step 7916 (ii), have the null hypothesis of causality accepted in step 7918 for all 
their associated eQTL that overleq) a trait cQTL, and, optionally, (iii) have a druggable 

10 domain as determined by step 7920. In some representative embodiments, a high rank 
means within the top 300, top 200, top 20%, or top 10% of the cellular constituents in the 
candidate causal cellular constituent set. 

Step 7924. The preceding steps describe an analysis of a candidate causal cellular 
constituent set in order to identify cellular constituents that are causal for a trait of 

15 interest However, the causahty test of step 7918 can easily be rewritten to determine 
. whether (i) each eQTL, linked to a trait of interest T, and (ii) a cellular constituent in the 
. candidate causal cellular constituent set, are correlated conditional on the disease trait in 
the plurality of organisms. Thus, in addition to determining whether a cellular constituent • 
is causal for a trait, the methods of the present invention can be used to determine 

20 whether a cellular constituent is reactive to a trait of interest T. Further, the causality test 
of step 791 8 can easily be rewritten to determine whether (i) the trait of interest T, and (ii) 
a cellular constituent in the candidate causal cellular constituent set are correlated 
conditional on the QTL common to both traits. This last test determines whether a QTL 
common to the trait of interest T and cellular constituent trait drives each of the traits 

25 indq)endently, so that the cellular constituent trait is neither causal nor reactive to the trait 
T of interest Information on which genes are causal and which genes are reactive for a 
trait of interest can be used to reconstruct a genetic network usiag Bayesian analysis. 

Section 5. 19.5, below, outlines methods that can be used to validate the hypothesis 
that certain cellular constituents are either causal or reactive to a trait of interest Further, 
30 multivariate analysis can be used to determine wheth^ such cellular constituents act in 
concert, in the form of a biological pathway, in order to affect the trait under study: In 
one embodiment in accordance with the present invention, the degree to which each high 
ranking cellular constituent makes up a candidate pathway group that affect the trait of 
interest (or are affected by the trait of interest) is tested by fitting a multivariate statistical 



255 



wo 2004/061616 PCTAJS2003/041613 

model to the eQTL of the high ranking cellular constituents. Multivariate statistical 
models have the capability to consider multiple quantitative traits simultaneously, model 
epistatic interactions between the QTL and test other interesting variations that test 
whether a group of cellular constituents belong to the same or related biological pattiway. 
5 Specific tests can be done to determine if the traits under consideration are actually 
controlled by the same QTL (pleiotropic effects) or if they are independent. 

In:q)ortantly, multivariate statistical analysis can be used to simultaneously 
consider multiple traits. This is of use to determine whether the traits are genetically 
linked to each other. Accordingly, in such embodiments, the eQTL of high ranking 
10 cellular constituents can be subjected to multivariate statistical analysis in order to 
determine whether the QTL are all genetically linked. Such an analysis can determine 
that some of the QTL in the cluster found in the QTL interaction m^ are, in fact, linked 
whereas other QTL in the cluster are not linked. 

Multivariate statistical analysis can also be used to study the same trait from 
15 multiple tissues. Multivariate statistical analysis of the same trait bom multiple tissues 
can be used to determine whether g^etic linkage varies on a tissue specific basis. Such 
techniques are of use, for example, in instances where a complex disease has a tissue 
specific etiology. Exemplary multivariate statistical models that can be used in 
accordance with the present invention are found in Section 5.6. 

20 

5.19.4. CAUSALITY TEST 
This section provides more details on the causality test that is applied in step 7818 
of Fig. 78B. Let G be a gene expression trait for some gene g , and let T be a clinical 

trait For the correlation between G and T , it is of interest to determine those genetic 
25 and environmental components driving the association, and it is of interest to determine 
whether an assessment can be made in a genetics context as to whether one trait drives the 
other. That is, does one of the relationships depicted in Fig. 84A hold. 

It is not possible to look at these two traits in isolation and determine whether 
either one of the cases depicted in Fig. 84A holds. In the more classical graphical 
30 modeling context, where the aim is to reconstruct a complex network, different graphical 
structures are assessed and edges are weighted and directed in such structures using 
mutual information measures that examine all adjacent triplets (say, X^Y^ and Z ), where 



256 



wo 2004/061616 PCT/US2003/041613 

these variables represent any combination of QTL, expression trait or clinical trait in the 
graph where the topology of the graph is constrained a priori to satisfy certain 
mathematical conditions. 

Without the genetic information described herein this networic reconstruction 
S problem is difficult because many of the different possibilities that are considered are not 
distinguishable. For instance, consider the three possible relationships among three traits 
of interest depicted in Fig. 84B. Cases (i) and (ii) are not distinguishable because they 
have the same dependency structure. This presents problems for reliable reconstruction 
of g^etic networks given conelation data alone, since in many instances it will simply 
10 not be possible to direct edges (directing the edges in such graphs estabhshes the cause 
and effect relationships of interest to us in reconstructing pathways associated with 
disease). 

The embodiment of the invention outlined above, and illustrated in Fig. 78, has 
the significant advantage in that gene expression data and clinical traits are linked to 
1 5 quantitative trait loci (QTL). The QTL infpmiation provides a powerful filter that allows 
- forther^idrestrictionof attention from aU significantly correlated cellular constituents 

■ 

and trait values to those subsets of cellular constituents and traits that are mider the 
control of a common set of QTL. The triplets described in Fig. 84B then become QTL 
and traits and it is possible to initially direct an edge between the QTL and a single trait 
20 . by definition of a QTL, and then test all other traits pair wise as discussed below to 

detOTnine how the trait pairs are positioned relative to one another. For instance, going 
back to the case of a clinical trait T linked to a QTL Q , the relationship between Q and T 

can be inunediately fixed as illustrated in Fig. 84C. The relationship in Fig. 84C holds 
because Q is a QTL for T , and the QTL provides the direction of the relationship (T 

25 depends from Q) since Q is causal for T (e.g.^ variations in the DNA at the QTL location 
lead to variations in T). To position a given geae expression trait, G , that is correlated . 
with r , all that is required is a test for mutual independence of Q and T given G. That 

is, if r and g are independent given G , then the (g,r, G) triplet has the form depicted 

in Fig. 84D. However, lack of independence given G indicates one of the alternative 
30 possibilities given by Fig. 84E. ' 

The methods discussed below can be applied to determine which of the two 
structures (Fig. 84D versus Fig. 84E) is supported by the data. 



257 



wo 2004/061616 PCTAJS2003/041613 

More foimally, a detennination of whether T is correlated with the genotypes at 
Qj conditional on G is desired in order to assess if the following property holds: 

P{T,Q\G)^P{T\G)P{Q\Gy 

This property is satisfied only if J and are conditionally dependent upon G. For formal 
5 theoretical siqpport for this conditional depmdence property, see Pearl, 1988, 

Probabilistic Reasoning In Intelligent Systems: Networks of Plausible Inference, Revised 
Second Printing, Morgan Kaufinann Publishers, Inc., San Francisco, California, Section 
3.1.2. This conditional dq)endency property is related to the mutual information measure 
that is typically used in network reconstruction problems: 



10 I{T.Q\G)= P{T.Q^G)\og 



p(r,eiG) 



\^P{T\G)P{Q\G)J 

where the summation symbol indicates the continuous variables T and G have been 
discretized to allow for efficient computation over complicated graph structures, as is 
usually done in network reconstruction problems. The use of mutual mformation is the 
. reduction in uncertainty about one variable due to the knowledge of the other variable. 
15 See, for example, Duda et al, 2001, PaUem Classification, John Wiley & Sons, Inc., 
New York, p 632. 

While the mutual information measure is useM in more general network 
reconstmction problems, the problCTi addressed by the instant causahty test is 
significantly more simple than the general case because of the novel requirement that T 
20 and G are both linked to Q. This novel requirement leads to a more robust and more 

powerful test for causality. The purpose of the causality test of the present invention is to 
position a celMar constituent on the causal or reactive side of a clinical trait of interest, 
which can be accomplished by testing for independence of T and 2 » conditional on G , 
as discussed above. 

25 In developing a test for indqjendence, a few observations help clarify the specifics 

of such a test First, it is assumed a priori that G and T are significantly correlated to Q . 
That is, these quantitative traits both have QTL at position Q that give rise to significant 

■ • 

LOD scores. Second, it is noted that 

PiT,Q\G) = P{T\Q,G)P{Q\G), 

30 so that 



258 



wo 2004/061616 



PCTAJS2003/041613 



P{T,Q\G)^P{T\G)P{Q\G). 



if and only if 

P(T\Q,G)=^P{T\G), 

whenever P(jQ\G) > 0. 

S These relationships follow from the conditional independence of T and Q given G. 
Therefore, the term P{Q\G) can be ignored and the focus can center on the single 

conditional probabihty. What this last equation impUes is that if that portion of the 
correlation between T and Q that can be explained by the correlation between G and Q 
is conditioned out, then a detennination can be made as to whether the remaining . 
10 correlation between T and Q is still significant. If not, then it is expected that a 

significant QTL for T\Q and G\Q wiU arise, but that no significant QTL for T\Q,G 

will arise. By forming the loglikelihood ratio based on these two probabihty densities, 
the significance of the resulting LOD score can be used as the significance level for the 
test of independence. 

15 Before fonning the conditional likelihoods based on the conditional probability 

density fimctions discussed above, the likelihood for G and T for a single animal in an 
F2 population are formed, where G and T are taken to be jointly normally distributed, 
allowing for dependency between G and T . Under the null hypothesis of no correlation 
between (r, G) and genotypes at location Q, the likelihood for animal i is: 

20 l{BoA,gi) = r=^exp 



r 




1 


L 



where =(/i2.,/i^j,crj.,a^,yo) is the parameter vector for the likelihood, and p is the 

correlation between G and T . Under the alternative hypothesis where G and T are 
correlated with Q , the likelihood is: 



/(^.;^,,ali2)=Z^(ey) 



1 



exp 



25 where 



259 



wo 2004/061616 



PCT/US2003/041613 



10 



15 



1 









-Ip 



and ) is the probability of genotype at locus g . Given these likelihoods for the 

individual animals in an F2 population, the full likelihood ov^ all N animals for the null 
5 and alternative hypotheses, respectively, are: 

n 



and 



L{B,;GJ\Q)^X\l{ej:,g,MQ\ 



For each likelihood defined above the maximum likelihood estimates for 9^ and 9^ , 
miff^ are obtained. The likelihood ratio statistic is: 



' £((9o;G>r) 
l(9^;G,T\q)^ 



which is distributed with four degrees of freedom. 

With these maximum hkelihood estimates in hand for the null and alternative 
hypotheses, it is possible to compute the conditional likelihoods that are needed to assess 
conditional independence of T and Q . The form of the conditional likelihood for T | G 

(the conditional likelihood under the null hypothesis) for a single animal is: 



1 



exp 



■where b = n^-^ p — (z, - /i^ ) . The corresponding conditional likelihood under the 



alternative hypothesis is: 



1 



I g„Q) = i:P[Qj)- r , 



exp 



2cr^(l-p^) 



260 



wo 2004/061616 



PCTAIS2003/041613 



wh^e b^^fij^ + p 



The full likelihoods are: 



M 



and 



10 



• 15 



20 



Finally, from this, the conditional likelihood ratio test statistic of interest is obtained: 



The methods of the present inv^tion can be used to associate a cellular 
constituent with a complex trait This section discloses techniques that cian be used to 
validate such cellular constituents identijfied usiug the techniques of the present invention. . 
In some embodiments, gene knock-out / knock-in mice or transgenic mice are employed 
for such validation. In some embodiments, in vivo siKNA is used to vahdate such genes. 
See, for example, Cohen et al, 1997, J. Chn. Invest. 99, p. 1906; Xia, et al, 2002, Nature 
Biotechnology 20, p. 1006; Hannon, 2002, Nature 418, p. 244; Carthew, 2001, Current 
Opinion in Cell Biology 13, p. 244; Paddison, 2002, Genes & Developmoxt 16, p. 948; 
Paddison & Hannon, 2002, Cancer Cell 2, p. 17; Jang etcd., 2002, Proceedings National 
Academy of Science 99, p. 1984; and Martinez et aly 2002, Proceedings National 
Academy of Science 99, p. 14849. 

In some embodiments, before a putative target cellular constituent is biologically 
vaUdated in mice, association studies can be carried out in human populations to provide 

0 

m 

a source of validation in humans. Associating a gene in a human population with a 
clinical trait, where the gene in mouse 1) was physically co-localized with a cQTL for the. 
corresponding clinical trait in a segregating mouse population, 2) gave rise to a cis-acting 
QTL widi respect to its transcription, and 3) was sigoificantiy genetically interacting with 
the clinical trait QTL, is itself a very powerful validation of a gene's role in the complex 




where 0^ and Sj^ are the maximum likelihood estimates obtained from Lq and defined 



above. 



5.19.5. TARGET VALIDATION 



261 



wo 2004/061616 PCTAJS2003/041613 

trait of interest See, also. United States Provisional Patent Application 60/436,684 filed 
Decemb^ 27, 2002. The combined validation in mouse and human provides all that is 
necessary to move a target forward in a discovery program. Even in cases where the 
causal gene is not itself draggable, draggable targets driven by the causal gene can be 
S identified by examinitig those targets that have eQTL that co-localize and are interacting 
with eQTL for the causative gene. This speaks to the more general use of the combined 
genetics/gene expression ^roach to reconstruct genetic networks. 

5,20. DRUG DISCOVERY PARADIGM THAT INVOLVES THE 
0 COMBINATION OF GENETIC, FUNCTIONAL GENOMIC AND CLINICAL 

DATA 

Novel techniques for associating genes v/ifh complex traits using cross species 
data have been disclosed in the Sections above. This section describes a novel drug 
discovery paradigm that uses cross species data to identify potential dmg target 
5 candidates for drug discovery programs. Further, this section describes techniques for 
validating these potential drug target candidates. An illustration of a novel paradigm in 
accordance with one embodiment of the present invention is disclosed ta Fig. 69. 

* 

Step 6902, The drug discovery paradigm begins with step 6902 where a 
. ther^eutic area or disease is selected. The paradigm is particularly useful for associating 
0 genes with complex diseases such as those described in Section 5.12, above. 

Step 6904. In stq> 6904, inbred strains that are discordant for the phenotype of 
interest (eg., the con:q)lex trait) are used to construct genetic crosses that are phenotyped 
and genotyped. Further, tissues relevant to the disease s;elected in step 6902 are obtained 
from ttie crossed progeny and the levels of a plurality of cellular constituents in these 
5 tissues are measured. Representative forms of cellular constituents that can be measured 
in step 6904 are described in Section 5.1. Step 6904 produces three forms of data for an 
inbred population. They are phenotypic data, expression data and genotypic data. 

Step 6906. In step 6906, a human population, or some o&er form of outbred 
population, is identified and relevant tissues from family based samples with disease 
0 related phenotypes are collected for expression profiling and for construction of genome- 
wide genotyping. Step 6906 produces three forms of data for an outbred populiation. 
They are phenotypic data, expression data and genotypic data. 

Stq? 6908, Jn step 6908, the individuals are profiled in order to idaitify a disease 
associated pattem. In one approach, in accordance with step 6908, the population 



9 



262 



wo 2004/061616 PCTAJS2003/041613 

observed in step 6904 (or step 6906) is stratified based on a clinical trait that is relevant to 
the disease selected in step 6902. Then, the upper and lower extremes of this stratified 
population are considered. Specifically, those genes that are the most differentially 
expressed in the upper and lower extremes of the stratified population are selected. This 
5 . set of genes can be considered the most transcriptionally active set of genes for the 
population Ming in the tails of the clinical trait distribution. This set can be termed the 
"active set". The selection of the active set is not biased by selecting genes based on their 
ability to discriminate between the clinical trait extremes. 

The active set is used to help define the clinical trait under study. Expression 
0 vectors for each of the genes in the active set are constructed. Each expression vector 
includes the expression value of a given gene in the active Set across the organisms in the 
identified population. Then, the expression vectors are subjected to two-dimensional 
cluster analysis. On the first axis (e.g., the x-axis), the expression vectors for each of the 
genes in the active set are clustered. To form the clustaing on the other axis (e.g., the y- 
5 axis), au organism vector is coiistnicted for each of the orgamsms in the population. Each 
such organism vector includes the expression value for each of the genes in the active set 
The organism vectors are clustered along the y-axis. Thus, the first axis clusters genes 
that express similarly across the population and the second gxis clusters organisms that 
, have similar gene expression values for the active set. Each x,y coordinate in the two- 
:0 dimensional graph represents a cellular constituent level for a gene in a given organism. 
In some embodiments, each x,y coordinate in the two^dimeosional graph is color coded to 
indicate the expression level of the gene in the given organism relative to a reference 
pool. An example of this form of two-dimensional cluster analysis is provided in Section 
5.19. 

15 The two-dimensional cluster analysis allows for the determination of subgroups in 

the population. Clearly such subpopulations will be defined by clusters on the second 
axis (e.g., y-axis). However, the patterns produced by the clustering on the first axis aid 
in defining the subpopulations on the second axis. Namely, each subgroup on the second 
axis should have similar patterns of expression across the active set. In Fig. 59, for 

to example, the y-axis was not clustered based on a clmical trait Nevertheless, the mice on 

■ • • • - « 

the y-axis cluster into distinct phenotypic groups. The first set is the low fat pad mass 
group. The low fat pad mass groiq) is defined by two factors. First, the low fat pad mass 
group define a cluster on the y-axis. Second, genes in the low fat pad mass ^up that are 
In set 5902 tend to be green-shifted relative to the reference pool whereas genes in set 804 

263 



wo 2004/061616 PCT/US2003/041613 

tend to be red-shifted relative to the reference pool. The expression pattern of the genes 
in the 280 member set along the y-axis serve to validate that the low fat pad mass group is 
not, in fact, a composite of two or more subgroiips. Continuing with this form of 
analysis, two other groups (high fat pad mass 1 and high fat pad mass 2) are defined on 
5 the y-axis and validated by the pattern of expression along the y-axis as summarized in 
the following table: 

Name Y-axis X-axis - gene set 802 X-axis -gene set 804 

LowEPM Cluster 5910 Green Red 
HighFPM2 Cluster 5912 Green Red 
• HighFPMl Cluster 5914 Green/red Green 



Step 6910. In step 6910, the underlying genetics of genes involved in the patterns 
identified in step 6908 are evaluated. For exaniple, the groups of organisms idmtified in 

10 the analysis above low FPM organisms, high FPM 1 organisms, hi^ FPM 1 

organisms) can be subjected to independent quantitative genetic analysis. For example, 
those classified as high FPM group 1 or low FPM, and 2) those classified as high FPM 
group 2 or low FPM can each independently be subjected to quantitative analysis. In this 
quantitative analysis, the phenotypic trait associated with the disease selected in step 6902 

15 {e.g.^ FPM) is analyzed using the subpopulations identified in the two-dimensional cluster 
analysis rather than the whole population. Several forms of quantitative genetic analysis 
are possible. First, a clinical trait can be used to derive clinical QTL (cQTL). Second, 
expression values for the genes can be used to derive eQTL. 

Regardless, of whether clinical or expression values serve as quantitative traits, 
20 the exact forms of quantitative genetic analysis used in step 6910 will depend on the type 
of phenotypic data that is available. Inbred populations (or subpopulations) can be 
profiled using linkage analysis as described in Section 5. 13 or any of the techniques 
described in Chapter 15 of Lynch and Walsh, 1998, Genetics and Analysis of Quantitative 
Traits^ Sinauer Associates, Mc, Sunderland, MA. Outbred populations (or 
25 subpopulations) can be profiled using association analysis techniques described in Section 
5.14 or any of the techniques described in Chapter 16 of Lynch and Walsh, 1998, 
Genetics and Analysis of Quantitative Traits^ Sinauer Associates, Inc., Sunderland, MA. 



264 



wo 2004/061616 PCTAJS2003/041613 

The purpose of step 6910 is to idmtify cQTL and eQTL that are associated with the 
disease selected in step 6902. 

■ 

Step 6912. in step 6912, the genetics of the disease associated pattern is 
intersected with the genetics of disease related traits to identify key drivers. Steps 6902 
S through 69 1 0 serve to identify patterns of e?q)ression associated witii a clinical trait 

related to the disease under study. The quantitative genetic analyses identify genetic loci 
(eQTL and/or cQTL) that control subtypes of the disease. However, such techniques do 
not, by themselves, lead to identification of the underlying QTL of interest. Genes 
underlying QTL controlling for a clinical trait can cause variation in the trait through 
10 polymorphic transcription due to DNA polymorphisms. Such genes can be identified by 
combining genetics with gene expression. 

In one approach in accordance with step 6912, eQTL that colocalize with cQTL 
and with .the physical location of the gene whose transcription gives rise to the eQTL are 
identified. In cases where the gene underlying a QTL for a clinical trait controls the 
15 variation of that trait through variation in transcription associated with DNA 

. polymorphisms in the gene itself the expression of that gene treated as a quantitative trait 

■ 

. should give rise to an eQTL coincident with the cQTL. Depending on the degree of 
heritablity of the clinical and expression traits, and the percentage of variation of the trait 

. explained by the cQTL, it is not expected that the clinical trait values and the expression 
20 values will be significantly correlated, even if variation in transcription of the gene causes 
variation in the clinical trait. However, significant genetic correlation between the 

. clinical trait QTL and gene expression traits is expected in such cases. Therefore, testing 
for interaction between the clinical trait QTL and the gene expression QTL can identify 
candidate genes underlying the cQTL for the clinical trait of interest 

25 Techniques that simultaneously analyze multiple QTLs can be used to identify 

whether eQTL and cQTL are linked. Such techniques include marker-difference 
regression (also known as marker regression or joint me^ping). See, for example, 
Kearsey and Hyne, 1994, Theor. Appl. Genet 89, p. 698; Wu and Li, 1994, Theor. Appl. 
Genet, 89, p. 535. Such techniques further include interval mapping with marker 

30 cofectors. See, for example, Jansen, 1992, Theor. Appl. Genet 85, p. 252; Jansen, 1993, 
Genetics 135, p. 205; Zeng, 1993, Proc. NatL Acad. Sci. USA 90, p. 10972; Zeng, 1994, 
Genetics 136; p. 1457; Stam, 1991, Proceedings of the Eight Meeting of the Eucarpia 
Section Biometrics on Plant Breeding, Brno, Czechoslovakia, pp. 24-32; Jansen, 1995, 



265 



wo 2004/061616 PCT/US2003/041613 

Theor. Appl- Genet. 91, p. 33; van Ooijen, 1994, in van Ooijen and Jansen (eds.). 
Biometrics in plant breeding: applications of molecular markers^ pp. 205-212, CPRO- 
DLO, Neth^lands; and Utz and Melchinger, 1994, in van Ooijen and Jansen (eds.), 
Biometrics in plant breeding: applications of molecular markers, pp. 195-204, CPRO- 
5 DLO, Netherlands. Such techniques further include multiple-trait extensions 

composite interval mapping given by Jiang and Zeng. See, for example, Jiang and Zeng, 
1995, Genetics 140, p. 1 1 1 1; and Ronin, et ai, 1995, Theor. Appl. Genet. 90, p. 776. 

Step 6914. In step 6914, each gene identified in step 6912 is validated using an 
association analysis in an independent population. The type of association analyses used . 
10 in step 6914 can be, for example, any of the various forms of association analyses 

described in Section 5.14, above. Further, any of the techniques described in Chapter 16 
of Lynch and Walsh, 1998, Genetics and Analysis of Quantitative Traits, Sinauer 
Associates, Sunderland, MA, can be used in step 6914. 

Step 6916, In step 6916, each gene identified in step 6912 is vaUdated using 
15 advanced crosses, congenic strains or similar modeling systems. Congenics are usefid for 

validations in step 6916. Once a QTL is identified for the trait of interest, the strain 

whose congenic region covers the QTL region can be identified and studied with respect . 

to the same phenotype. Further, more complicated genetic models can be constructed . 

usiag the congenics, based on QTL results bom, for example, an F2 cross. For example, 
20. suppose two strongly interacting QTL were identified fix)m the F2 cross. The congenic 

■ 

strains covering the two QTL regions could be bred to construct a new congenic strain 
that had two congenic regions, each covering one of the QTL of interest. These mice 
could then be studied with respect to the phenotype of interest. The advantage to this sort 
of constmction is that the congenic strains are stable and can be constantly bred to 
25 generate progeny that are genetically identical (unlike the F2 populations, where there is 
no hope of recovering the same genetic background). 

Step 6918. In step 6918, syntmy (comparative mapping) can be used to provide 
an informed selection of targets. For example, putative candidates idmtified (and 
possibly validated) in one species using the steps described above can be mapped to 
30 orthologs in another species using a comparative genetic m^ between the two species. 
Then a determination can be made as to whether the region in the second species has been 
associated with the disease in the second species. 



266 



wo 2004/061616 PCT/US2003/041613 

This strategy was employed by Schadt et aL, 2003, Nature 422, p. 2^7 to increase 
confidence in mouse chromosome QTL tentatively associated with obesity. The mouse 
loci was homologous to human chromosome 20ql2-ql3,12, a region that has previously 
been linked to human obesity-related phenotypes. The human orthologs for the 
'5 candidates identified in the mouse also reside in the human chromosome 20 region. 

Step 6920. Not all genes linked to a disease serve as drugable targets. For 
example, genes that encode for proteins such as transcription factors may not be ideal 
targets for a drug discovery program even when their linkage to a disease of interest has 
been validated. Thus, step 6920 serves to analyze genes that have been linked to a 
10 ■ disease of interest to determine if they are suitable targets for drug discovery. 

Step 6922, In stq) 6922, gene targets are fiarfher validated by techniques such as 
gene knock-out / knock-ia mice, transgenic mice,"or RNAi techniques. Figure 70 
provides a hypothetical example of a validation strategy in accordance with one 
embodiment of the present invention. In this example, genes Yl through Y4 are genes 
15 that are part of an expression pattern associated with a complex trait of interest. The 
upper panel plots the lod score curves for the four genes for a particular chromosome, 
• . where the cluster of eQTL depicted are.coincident with a cQTL for the complex trait By 
• examining genes that physically reside in the QTL support interval, those genes that have 
cis-acting eQTL that are significantly genetically interacting with the other eQTL/cQTL 
20 are identified. These genes represent the potential causative genes underlying the 

cQTL/eQTL. Gene X in Fig. 70 highlights one such example. By knocking gene X out 
using in vivo small interfering RNA (siRNA) methods, the siKNA knock-out animals can 
be profiled and the genetic signatures of the.original genes making up ttie eQTL cluster 
examined. Various siKNA knock-out techniques (also referred to as RNA interference or 

m 

25 post-transcriptional gene silencing) are disclosed, for example, in Xia, et a/., 2002, Nature 
Biotechnology 20, p. 1006; Hannon, 2002, Nature 418, p. 244; Carthew, 2001, Current 
Opinion in Cell Biology 13, p. 244; Paddison, 2002, Genes & Development 16, p. 948; 
Paddison & Hannon, 2002, Cancer Cell 2, p. 17; Jang et aL, 2002, Proceedings National 
Academy of Science 99, p. 1984; Martinez et aL, 2002, Proceedings National Academy 

30 ofScieace 99, p. 14849. 

* • « • 

The lower panel in Fig. 70 highlights what is expected if gene X were in feet 
driving the eQTL cluster shown in the upper panel. That is, the disappearance of the 
eQTL cluster would validate gene X's role as the causal factor underlying the expression 



267 



wo 2004/061616 PCT/US2003/041613 

pattern associated with the complex trait, and thus, would solidify its role as a key driver 
for the corresponding complex trait. If the complex trait were a disease like obesity, then 
validating a gene for the obesity trait directly would require the construction of, say, a 
knock out animal for that gene, which is a lengthy process. However, by defining the 
5 complex trait in terms of expression patterns, the candidate gene can be perturbed in more 
specialized ways and the effects on the expression pattern observed, which can happen in 
a much shorter time firame. 



5.21. TEST FOR PLEIOTROPY 

10 hi some embodiments, a detennination is made as to whether the coincidence 

between an eQTL and a respective cQTL arises through pleiotropy or close linkage 
between QTL. When a determination is made that the coincidence between an eQTL and 
a respective cQTL is the result of two closely linked QTL, association between the 
cellular constituent corresponding to the eQTL and the trait corresponding to the cQTL is 
15 not made. In some embodiments, a test for pleiotropy conq>rises comparing a model for 
the null hypothesis, indicatuig the result of pleiotropy, to a model for the alternative 
^ hypothesis, indicating two closely linked QTL. . 

In some embodiments, the* model for the null hypothesis is: 



20 



Ml 



r 



2/ 



'I 



where 



Q is a categorical random variable indicating the genotypes at the position of the 



eQTL and the cQTL in the plurality of organisms; 



is distributed as a bivariate normal random variable with mean 



0^ 
0 



and 



25 covaiiance matrix 



;and 



|ii and Pi are model parameters. 



In some embodiments, the model for the alternative hypothesis is: 



\yi) 



La P 



4/ 



268 



f 



wo 2004/061616 

where 



PCT/US2003/O41613 



Qj and Q2 are categorical random variables indicating the genotypes at the 
position of the eQTL and the cQTL in the plurality of organisms; 

ro 



is distributed as a bivariate noimal random variable with mean 



0 



and 



covariance matrix 



^1 



; and 



|Lii and pi are model parameters. 



In some embodiments the model for the altemative hypothesis is: 



ay 



"1 



10 wherein 



Qi and Qz are categorical random variable indicating the genotypes at the position 
of tibfi eQTL and the cQTL in the plurality of organisms; 



is distributed as a bivariate normal random variable with mean 



covariance matrix 



2 J 



15 \i\ and Pi are model parameters; and 

one of the conditions (i) through (iv) is valid: 



20 



(i) pi ?i0,p4 5«sO, P2-0,andp3 = 0; 

(ii) pi ?fO,P4 ?s0,p2 ?^'0,andP3 = 0; 

(iii) pi 91^0, P4 5^0, P2 = 0, and P3 0; and 

(iv) Pi 5^0, p4 ^Q, P2 7^0, and P3 5^0. 



In some embodiments the loglikelihood for the null hypothesis and the altemative 
hypothesis are maximized with respect to the model parameters and cTj^) using 

25 maximum likelihood analysis, Afla: maximum likelihood estimates are obtained for each 
model, the likelihood ratio test statistic between the competing models is formed and the 
test statistic is used to determine whether the model for the alternative hypothesis 



269 



wo 2004/061616 PCT/US2003/O41613 

provides for a statistically significaot better fit to the data than the model for the null 
hypothesis. 



6. EXAMPLES 

5 The following examples are presented by way of illustration of the invention and 

are not Umitmg. 

6.1. EXEMPLARY SOURCES OF GENOTYPE AND PEBIGREE DATA 
Mice. The methods of the present invention are apphcable to any living organism 

10 in which genetic variation can be tracked. Therefore, by way of example, genotype 
and/or pedigree data 68 (Fig. 1) is obtained from experimental crosses or a human 
population in which genotyping information and relevant clinical trait information is 
provided. One such experimental design for a mouse model for complex human diseases 
is given in Fig. 5. In Fig. 5, there are two parental inbred Unes that are crossed to obtain 

15 an Fi generation. The Fi generation is mtercross'ed to obtain an F2 generation. At this 
point, the F2 population is genotyped and physiologic phenotypes for each F2 in the 
population are determined to yield genotype and pedigree data 68. These same 
determinations are made for the par^ts as well as a sampling of the F] population. 

Human populations. The present invention is not constrained to model systems, 
20 but can be applied directly to human populations. For example, pedigree and other 
genotype information for the CEPH family is pubUcly available (Center for Medical 
Genetics, Marshfield, Wisconsin), and lymphoblastoid cell lines from individuals in these 
famihes can be purchased from the Coriell Institute for Medical Research (Camden^ New 
Jersey) and used in the expression profiling experiments of the instant invention. The 
25 plant, mouse, and human populations discussed in this Section represent non-limiting 
exan:q)les of genotype and/or pedigree for use in the present invention. 

6.2. IDENTIFICATION OF REGIONS THAT BROADLY CONTROL 

TRANSCRIPTION 

30 The genome*wide cbnsideratibn of all genes as quantitative traits, representation 

of individual QTL analysis results in a database, and summarizing the degree of overlap 
among all genes at all positions where a QTL analysis was run enables the identification 
of regions that very broadly control transcription. For a given organism, this allows for 

270 



wo 2004/061616 PCT/US2003/O41613 

the ideatification of regions that potentially control for basal-level transcription levels 
across most genes that are expressed. An in^ortant utility that is provided by the 
methods of the present invention is the identification of those genes that control 
biological pathways and / or interactions between biological pathways as well as the 
5 sq)aration of these genes from genes that are simply responding to the signals propagated 
by the potentially small set of genes. 

Some approaches seek genes that have significantly co-regulated expression 
patterns over a number of relevant conditions. Many fonns of cluster analysis and other 
pattern detection schemes are used to imcover such patterns. Then, techniques such as 

10 multivariate analysis are used to determine whether these co-regulated genes participate 
in the same biological pathway (e.g., whether these genes genetically interact or control 
each other). That is, multivariate techniques are used to determine whether such genes 
are tram acting. Howev^, most strongly genetically controlled genes are actually flie 
least similar, least co-regulated with respect to other genes because their expression 

15 patterns are independent of the expression patterns of other genes. Therefore, it is 
expected that trans acting genes (e.g., genes acting on other genes to affect gene 
transcription) are harder to detect than cis acting genes. An example of a cis acting gene 
is a gene in which variation within the gene affects transcription of the gene itself The 
methods of the present invention allow for the identification of trans acting genes. The 

20 identity of trans acting genes fiirther elucidates control of pathways and disease etiology 
since they are ostensibly important to the proper functioning of so many pathways. 

6.3. IDENTIFYING GENES UNDER GENETIC CONTROL IN SMALL 

POPULATIONS 

25 la this example 56 individuals fi-om four CEPH reference families (Dausset, 1990 

Genondcs 6:575-577) were selected for expression profiling of lymphoblastoid cell.Iines 
using a standard 25K himian gene oligonucleotide microarray. The 25K human gene 
oligonucleotide microarray is described in van*t Veer et al.. Nature 415, 530-536 as well 
as Hughes etaL, 2001, Nat Biotechnol. 19, 342-347. Briefly, labeled cRNAs were 

30 fragmented to an average size of approximately 50-100 nucleotides by heating at 60°C in 
the presence of 10 mM ZnCk, added to hybridization buffer containing IM NaCl, 0.5% 
sodium sarcosine, 50mM MES, pH 6.5, and formamide to a final concentration of 30%, 
final volume 3 ml at 40°C. The 25K human gene oligonucleotide microarray represents 
24,479 biological oligonucleotides plus 1,281 control probes. 

271 



wo 2004/061616 PCT/US2003/041613 

The four families, CEPH/Utah pedigrees 1362, 1375, 1377 and 1408, consisted of 
large sibships along with parents and grandparents. These CEPH families have sCTved as 
an important scientific resource for polymorphism discovery and human genetic map 
construction. Hence, extensive genotype data is publicly available for these families. 

5 Lymphoblastoid cell lines &om CEPH/Utah pedigree families 1362,1375,1377 

and 1408 were obtained from Coriell Cell Repositories, Camden, NT. Other- 
lymphoblastoid cell lines were established from normal donors by immortalization with 
Epstein-Barr Virus (EBV) as described by Tosatio, Generation of Epstein-Bair Virus 
(EBV)-immortalized £ cell lines, Current Protocols in Immunology 1, 7.22.1-7.22.3, John 

10 Wiley & Sons, New York, 1991. Cells were cultured in RPMI 1640 medium containing 
15% fetal bovine serum, and penicillin/streptomycin antibiotics (Ihvitrogen Life 
Technologies, Carlsbad, CA). Cells were maintained in the log phase of cell growth for 
at least two days and were harvested at densities of 0.4- 0.9 x lOe^ cells/ml. Total cellular 
RNA was flien purified using an KNeasy Mmi kit according to the manufecturer's 

15 instructions (Qiagen, Valencia, CA). Competitive hybridizations were performed by 

fluorescently labeled cRNA (5/xg) from each CEPHAJtah lymphoblastoid line 
with the same amount of cRNA from a reference pool, comprising equal amounts of 
cRNA from lymphoblastoid lines established from seven unrelated normal blood donors. 
The human microarray contained 24,479 non-control oligonucleotide probes for human 

20 genes. The hybridizations were performed in duplicate with fluor reversal. 

■ 

Array images were processed to obtain background noise, single channel intensity, 
and associated measurement error estimates. Expression changes between two samples 
were quantified as logio (expression ratio) where the 'expression ratio' was taken to be 
the ratio between normalized, background-corrected intensity values for the two channels 
25 (red and green) for each spot on the array. An OTor model for the log ratio was applied to 
quantify the significance of expression changes between two samples. See Roberts et ah, 
2000, "Signaling and Ckcuitry of Multiple MAPK Pathwaj^ Revealed by a Matrix of 
Global Gene Expression Profiles," Science 287, 873-880. 

Genotype data for the four CEPH femilies was obtained from the CEPH Genotype 
30 database (Munay etal, 1994, Science 265, 2049-2054,). A total of 495 autosomal STR 
polymorphisms were selected for analysis. Polymorphisms were chosen so that 
genotypes were available for all but three or fewer individuals per pedigree with this 
condition being true in at least three of the pedigrees. Marker positions were assigned 



272 



wo 2004/061616 PCT/US2003/041613 

using a Marshfield sex-averaged genetic map (Broman et al^ 1998, Am J. Hum. Genet. 
63, 861-869). Variance-components analysis (Amos, 1994, Am J. Hum. Genet. 54, 535- 
543) was used to estimate the heritability of gene expression, as measured by the mean 
logio expression ratio, for each of the 2,726 mKNA that were significantly differentially 

5 expressed in the founders, and to test whether the heritability was significantly different 
from zero. Genes were defined as dijSerentially regulated if eight or more founders had a 
p-value for differmtial expression less than 0.05. Heritability estimates were obtained by 
maximizing the likelihood assuming a multivariate normal distribution for the vector of 
phenotypes for the pedigree. The null hypothesis of no heritability was tested by 

10 comparing the full model, which assumes genetic variation,, and a reduced model, which 
assumes no genetic variation, using a likelihood ratio test. The above analyses was 
repeated allowing for a shared household effect. AU analyses were performed using 
procedures contained in the Sequential OUgogenic Linkage Analysis Routines (SOLAR) 
package (Almasy and Blangero, Am. J. Hum. Genet. 62, 1198-1211, 1998). 

15 As described above, heritability. analysis was performed for gene expression on a 

: subset of 2,726 genes that were significantly differentially regulated within 8 or more of 
• the 16 pedigree founders.. Due to the relatively small popidation size, systematic linkage • 
analysis across aU genes was not performed. As indicated in Fig. 6, for the differentially . 
expressed genes, 29% had a detectable genetic component (Type I error < 0.05). This 

20 result offers a striking glimpse into the genetics of gene expression in humans, with such 
a large percentage of genes detected with significant heritabilities .in such a smaU sample 
of ''normal" individuals. The group of genes having a detectable genetic component 
makes good targets for complex human diseases, giv^ the degree of genetic control in 
these genes is so readily identifiable in this small population. A closer look at many of 

25 the genes with most significant heritabilities show that many have akeady been 
implicated in human complex diseases: 1) Coagulation Factor XTTT, associated with 
thrombosis franco et al, 1999, Tliromb. Haemost 81, 676-679) , 2) Vitamin D Receptor 
, associated with osteoporosis (Ralston, 2002, J. Clin Endocrinol Metab 87, 2460-2466), 
3) BCARl, potentially associated with resistance to breast cancer treatment (Brinkman et 

30 a/., 2000, J Natl Cancer Inst 92, 1 12-120), 4) Glycophorin C, associated with red blood 
cell ovalocytosis and malaria resistance (Mgone et al^ 1996, Trans R Soc Trop Med Hyg 
90, 228-231), 5) Catenin, expressed in colon cancer (Morin et aL, 1997, Scimce 275, 
1787-1790), and 6) Cubilin null mutations have been associated with hereditary 
megaloblastic anemia (Aminofif et aL^ 1999, Nat Genet 21, 309-313). 

273 



wo 2004/061616 



PCT/US2003/041613 



6.4. GENETIC ANALYSIS OF THE MOUSE TRANSCRIPTOME 
The following example illustrates how the methods of the present invention 
uncover significant patterns of gene interactions. In particular, the example demonstrates 
S how QTL that are linked to quantitative traits (e.g. , expression statistic sets 304) cluster to 
specific loci. As defined previously, a QTL is a region of any genome ttiat is responsible 
for variation of a quantitative trait. A QTL that is linked to a given expression statistic set 
304 is referred to as an "expression QTL" or "eQTL". Further, the example illustrates 
how quantitative trait locus analyses can detect several types of transcript abundance 
10 polymorphisms, such as differential transcript decay, differential dosing, differential 
splicing, and differential transcription rate. As such, this example illustrates the type of 
information that can be obtain by performing steps 202 through 210 of Fig. 2. 

An F2 intercross was constructed fiom C57BL/6J and DBA/2J strains of mice. 
All mice were housed under conditions meeting the guidelines of the Association for 

15 Accreditation of Laboratory Animal Care. Mice were on a rodent chow diet up to 12 
months of age, and then switched to an atherogenic high-fat, high-cholesterol diet for 
another four months. See, for example, Drake et aL, 2001, Physiol Genomics 5, 205-15, 
which is hereby incorporated by reference in its entirety. Parental and F2 mice were 
sacrificed at sixteen months of age. At death the livers were immediately removed, 

20 flash-frozen in liquid nitrogen and stored at -80°C. Total cellular RNA was purified from 
25 mg portions using an Rneasy Mini kit according to the manufacturer's instructions 
(Qiagen, Valencia, CA). Competitive hybridizations were performed by mixing 
fluorescenfly labeled cRNA (5 mg) from each of 1 1 1 F2 hver samples, 5 DBA/2J liver 
samples, and 5 C57BL/6J liver samples, with the same amount of cKNA from a reference 

25 pool comprised of equal amounts of cRNA from each of the 1 1 1 liver samples profiled. 

Liver tissues from the 1 1 1 F2 mice constmcted from two standard inbred strains 
of mice, C57BL/6J and DBA/2J, were profiled using a 25K mouse gene oligonucleotide 
microarray. The hybridizations were performed in duplicate using fluor reversal. The 
mouse microarray contained 23,574 non-control oligonucleotide probes for mouse genes 
30 and 2,186 control oligos. Full-length mouse sequences were extracted from Unigene 
clusters, build # 91 (Schuler et al, 1996, Science 274, 540-546), and combined with 
RefSeq mouse sequences fiom June 2001 (Pruitt and Maglott, 2001, Nucleic Acids 
Research 29, 137-140), and RIKEN fiiU-lengtti sequences, v^on fantom 1.01 (ECawai et 

274 



wo 2004/061616 PCT/US2003/041613 

a/., 2001, Nature 409, 685-690, 2001). This collection of fiiU-length sequences was 
clustered and one representative sequence per cluster was selected, resulting in 18,597 
full-length mouse sequences. To conaplete the array, 3* ESTs were selected from 
Unigene clusters that did not cluster with any full-length sequence from Unigene, RefSeq, 
5 or RKEN. To foxQier down select ESTs, 3' ESTs that had significant homology to 
human genes were chosen, resulting in 4,977 3 ' mouse ESTs with human homology. To 
select a probe for each gene sequence, a series of filtering steps was used, taking into 
account repeat sequences, bindmg energies, base con:q)osition, distance from the 3' end, 
sequence complexity, and potential cross-hybridization interactions (Hughes et a/., 2001, 
10 Nat Biotechnol. 19, 342-347). For each gene, every potential 60-nucleotide sequence was 
examined and the 60-mer best satisfying the criteria was selected and printed on the 
microarray. 

Array images were processed to obtain background noise, single channel intensity, 
and associated measurement error esthnates iising the techniques referenced in Hughes, 
15 2000, Cell 102, 109-26. Expression changes between two samples were quantified as 
logio (expression ratio) where the 'expression ratio' was taken to be the ratio between 
normalized, background-corrected intensity values for the two channels (red and green) 

m 

for each spot on the array. An error model for the log ratio was applied to quantify the 
significance of expression changes between two samples. This error model is described 
20 in Roberts et aL, 2000, Science 287, 873-880. This enor model for the log ratio was 
applied to quantify the significance of expression changes between the two samples. 

The expression values from these experiments were treated as quantitative traits 
and carried through a linkage analysis using evenly spaced markers across the autosomal 
chromosomes, to identify eQTL controlling for transcript abundances in this segregating 

25 population (Fig. 2, step 210). For this QTL analysis, a complete linkage map 70 (Fig. 1) 
for all chromosomes except the Y chromosome in mouse was constructed at an average 
density of 13 cM using microsatellite markers in the manner described by Drake et aL (J. 
Orthop. Res. 19, 51 1-5 17, 2001). linkage maps were constructed and QTL analysis was 
performed using MapMaker QTL (Uncohi, S.E., Daly, M.J. & Lander, E.S., Whitehead 

3Q Institute for Biomedical Research, Cambridge, MA) and QTL Cartographer (Basten, 

C.A., Weir, B.S. & Zeng, Z.B., Department of Statistics, North Carolina State University, 
Raleigh, North Carolina, 1999). Log of the odds ratio God) scores were calculated at 
2-cM intervals throughout the genome for each of the 23,574 genes represented on the 
mouse microarray. In addition to standard interval m^ing techniques employed to 

275 



wo 2004/061616 PCT/US2003/041613 

detect loci affecting the gene expression traits of interest, additional analyses were 
performed to determine whether controlling for genetic backgroxmd variation using 
makers outside a putative region of linkage and whether multiple traits considered 
simultaneously could increase evidence for linkage. Composite interval mapping 

5 ("CM") techniques were employed so that markers unlinked with the test position were 
considered as cofactors in the statistical model for marker-trait association. Given 
multiple quantitative traits, CIM analysis can be extended to consider multiple traits 
simultaneously, potentially dramatically increasing tiie power to detect loci affecting the 
traits of interest. Joiat CIM analysis was first described by Jian and Zeng (Genetics 140, 

10 111 1-27, 1995) and is currentiy implemented in the QTL Cartographer software. 

Of the 23,574 genes represented on the microairay, 7,861 were detected as 
significantly differentially expressed (Type I OTor = 0.05) in the parental strains or in at 
least ten percent of the F2 mice profiled. That is, the expression values for the candidate 
gene varied across the mouse population. Such behavior is in contrast to the case where a 
15 gene is not significantly differentially expressed across a mouse population because, for 
example^ it is always expressed at the same level or is rarely expressed at all. In this 
. experiment, genes that are differentially expressed are of interest for use in constructing 
expression statistic sets 304 (e.g., Figs. 3 A and 3B). 

Eiach of the 7,861 genes that exhibited differential expression were used to 
20 construct a respective expression statistic set 304 (e.g., Figs. 3 A and 3B). That is, each 
. set 304 corresponded to the expression value for one of the 7,861 differentially expressed 
genes firom each of the 1 1 1 F2 mice. Each set 304 therefore included 111 expression 
statistics 308 C^igs. 3A and 3B) and each of these expression statistics 308 represented 
the expression value for the same gene &om each of the 1 1 1 mice. These expression 
25 statistics sets 304 as well as a mouse genetic marker map 78 (Fig. 1) were used as input to 
standard QTL analysis software (Fig. 2, steps 208 and 210). Using such standard QTL 
analysis techniques, eQTL with a lod score greater than 4.3 (P-value < 0.00005) were 
identified for 2,123 genes. The lod scores over this set ranged firom 4.3 to 80.0 (p value 
« 10'^^, among the highest lod scores ever reported for a quantitative trait. On average, 
30 eQTL with lod scores greater than 4.3 ©cplained twenty-five percent of the transcriptioii 
variation of the 7,861 corresponding genes observed in the F2 set, with this percentage 
increasing to nearly 50% for lod scores greater than 7. For any given position, it is 
expected that no false positive eQTL over the 7,861 differentially expressed genes tested. 



276 



wo 2004/061616 PCT/US2003/041613 

If the multiple positions tested for each gene is taken into account, it is expected that only 
393 false positives at a lod score threshold of 4.3. 

In processing all genes with standard interval mapping techniques (without 
filtering on significant difTerential expression over the set of mice profiled), 4,339 eQTL 
S over 3,701 genes were detected with lod scores greater than 4.3. Vfhsa the lod score 
threshold was dropped to 3.0, 1 1,021 genes gave rise to at least one eQTL, with a total of 
17,415 eQTL over this set of genes. The number of eQTL with lod scores exceeding 7.0 
(p-value =10" ) jumped by 50% when genes that were not detected as significantiy 
differentially regulated in ten or more mice were considered (no additional genes at this 

10 threshold would be expected by chance). This indicates that, while individual tests of 
hypotheses on the differential regulation of a single gene may not be significant, viewing 
the behavior of that gene by genotype over 111 animals provides sufGciently more 
information on the biological activity of that gene. Of the 965 genes with lod scores 
greater than 7.0, 157 has a maximum log ratio separation among any two mice of less 

15 than 0.48. (less than 3.0 fold change), indicating a class of genes whose high lod scores 
reflect tight transcriptional control (small variance), not large expression differences. 
Additionally, 153 genes from this same set of 965. were expressed in noice homozygous 
for one of the parental strains at the genes^ location, but not detectably expressed in mice 
homozygous for the other parental straiiL 

20 Fig. 68 plots the percentage of eQTL at different lod score thresholds across 920 

evenly-spaced bins, each 2cM wide, covering the mouse genome. The number of eQTL 
in each bin was divided by the total number of eQTL plotted. EQTL hot spots are 
^parent on chromosomes 2, 6, 7, 10, 1 1 and 17, where for each of these hot spot 
locations, greater than one percent of the total number of eQTL identified genome wide 

25 localize to a 4 cM window. The highly non-uniform nature of this eQTL distribution over 
the chromosomes is not likely to have happened by chance. In fact, with 460 4cM 
windows over the 19 autosomal chromosomes, the probability that greater than one 
percent of the eQTL would localize to one such window is less than 1 .2 x 10'^^. These 
eQTL hot spots could represent loci driving key biological processes critical to the system 

30 under study, as will be discussed below. At a lod score of 4.3, over eighty percent of the 
genes have only a single eQTL, with only 10% of the genes having more than two 
detected eQTL. The view at a lower LDL score threshold represents a slighUy more 
complex picture, given the appearance of many more genes under the control of multiple 
loci, with greater than 40% of the genes having more than one eQTL and close to 4% of 

277 



wo 2004/061616 PCTAJS2003/041613 

tlie gens liavmg more than 3 detected el^l i.. wmie a 3.0 iod score does not meet 
genome-wide significance criteria (Lander & Kniglyak, 1995, Nat. Genet. 11, 241-7) in a 
single trait setting, and while this significance is even more questionable in a multiple- 
testing setting where a large number of traits is considered, the pattem of eQTL clustering 
5 to specific loci and the relationship between these genes with respect to expression, wheii 
taken together, lead to highly significant and interesting patterns that can be associated 
with phenotypes related to conunon diseases. 

Of the 23,574 genes represented on the mouse array, 18,460 could be reliably 
mapped to a imique autosomal chromosome location using the Celera Mouse Genome 

10 database. Of fliese 18,460 mapped genes, 3,007 had eQTL with lod scores greater than 
4.3, and 784 had eQTL with lod scores greater than 7.0. Approximately 34% of the 
mapped genes with eQTL exceeding 4.3 had a physical location coincident with the 
eQTL position, while 71% of the m^ed genes with eQTL exceeding 7.0 had a physical 
location coincident with its eQTL position. Due to the unreliable nature of QTL 

15 positioning in the type of experimental cross used in this experiment, an eQTL and gene 
were defined as coincident when the physical location of the gene mapped to within 
15cM of its eQTL. By chance, it would be expected that the physical location of genes 
would coincide with their eQTL positions fewer than 2% of the time. Keeping in mind 
that the number of mice considered in the QTL analysis is relatively small, leading to 

20 reduced power in detecting moderate to small QTL effects, the trend observed here is that 
eQTL with high lod scores are cis acting in most cases, while moderately significant QTL 
are transacting in most cases. This is consistmt with the expectation that first order 
eJSects (DNA variations in a gene that affect transcription of the gene itselQ are easier to 
detect than second order effects (genes acting on other gens to affect transcription). 

25 There are many possible explanations for significant eQTL identified for 

transcript abundance measurements. While the genetic regulation of transcription 
explains only a percentage of protein diversity, the extent of biologically meaningfiil 
polymorphisms that can be detected in this setting is surprising. In addition, additive and 
dominance effects in genes whose transcription is polymorphic can be teased apart in 

30 experimental crosses such as the one described in this example. 

« 

Fig. 7 illustrates a plot of the mean loglO expression ratios for the Apo-Al gene 
(lower panel) and a VCP-like ATPase gene (uppCT panel) by genotype at markers 
D9Mitl9 (lod score equal to 32.5) and D2Mit50 (lod score equal to 54.3), respectively. 



278 



wo 2004/061616 PCTAJS2003/041613 

liotn me Apo-Ai gene and tbe V UJf-lilce Aii'ase gene have lod scores exceeding '50.0. 
The highly significant eQTL are explained by the significant separation of the expression 
ratios between the genotypes and Ihe tigjit variance within each genotype group. The 
eQTL effect at the VCP-like ATPase gene is mostly additive, given tiie differences in 

'5" expression between the heterozygotes ("0") and DBA homozygotes ("-!")> and between 
the heterozygotes ("0") and B6 homozygotes ("+1 ")» are roughly equal. The eQTL effect 
at the Apo-Al locus has a large dominance component evidenced by the large expression 
separation between the DBA homozygotes ("-1") and the heterozygotes C9^% and the 
small separation between the B6 homozygotes ("+1") and the heterozygotes ("0"). In 

10 summary, the eQTL for Apo-Al draionstrates strong dominance and the QTL for the 
VCP-like ATPase demonstrates simple additive effects. Overall, for the 4,339 QTL with 
LODs greater than 4.3, roughly 20% demonstrated a significant dominance effect (lod 
associated with dominance effect greater than 3.0). 

Fig. 8 highlights a range of gene-centered polymorphisms known to exist between 
15 DBA and B6 mouse strains. In each of the examples highh^ted, the loci identified by 
. linkage to the transcript abundances of the genes listed were coincident with the physical 
location of the gene itself. Single nucleotide polymorphisms covered by 60-mer 
oligonucleotide probes would not be expected to significantly affect transcript abundance 
measurements among the samples (See Hughes, et aL, 2001, Nat Biotechnol 19, 342- 
20 347), but polymorphisms that lead to changes in transcript half-life, that directly enhance 
promoter and transcription factor binding sites, or more significant pol>anorphisms, such 
as insertions and deletions that could arise by alternative spUcing, all provide signatures 
tiiat are readily detectable by the examination of expression levels in a segregating 
population. 

25 In particular. Fig. 8 illustrates examples of four types of transcript abundance 

polymorphisms (differential transcript decay, dijBFerential dosing, differential splicing, and 
differential transcription rate) readily detected by eQTL analysis. More details on these 
observations are provided in Section 6.5 below. The mouse C5 gene has a two base pair 
deletion in a S' exon in the DBA strain, which causes a more rapid decay of the transcript 

30 m DBA compared to the B6 mouse strain. See, for example, Karp et al , 2000, 

-. . •• - • 

"Identification of complemrat factor 5 as a suscq>tibility locus for experimental allergic 
astihma," Nat ImmunoL 1, 221-226. A lod score of 27.4 centered over the C5 gene on 
chromosome 2 is readily detected (curve 802). The ALAD gene is preset in two copies 
in the DBA strain and only one copy in the B6 strain. See, for exan5)le, Claudio et al^ 

279 



wo 2004/061616 PCT/US2003/041613 

1997, "A murine model genetic susceptibility to lead bioaccumulanon," i^undam Appi 
Toxicol 35, 84-90. The major QTL (lod score of 9.3) for ALAD transcript abundances is 
centered over the ALAD gene (curve 804) and represents the differential dosing that 
occurs between the two strains, due to the different copy numbers. The ST7 gene is 

5 differentially spUced at several locations (See Huang et al, 2002, Nucleic Acids Res 30, 
186-190), and for a stable splice form at the 3' location of the gene, the probe for this 
gene fortuitously overlapped the region alternatively spliced out ill DBA, but not B6. The 
differential spUcing event is detected by the major QTL (lod score of 20.1) for ST7, 
which is centered over the ST7 gene (curve 806). Finally, the NNMT gene, important for 

10 drug metabolism, is known to be polymorphic with respect to transcription between the 
DBA and B6 strains. See, for example, Huang et al, 2002, *Tutative Alternative Splicing 
. database," Nucleic Acids Research 30, 186-190. This polymorphism is confirmed by a 
major QTL (lod score of 15.3) for the NNMT gene, centered over the NNMT gene (curve 
808). 

1 5 Identification of cis-acting transcriptional control can serve as a filter for 

associating polymorphisms in DNA sequence with polymorphisms in transcription. For ' 
• instance, while the DNA variations noted in Fig. 8 lead to transcriptional polymorphisms, 
:.. the insuhn-like growth factor binding protein complex acid labile chain (Igfals) has five 
SNPs identified between the B6 and DBA strains, two of which are mis-sense mutations: 
20 1) codon 165 is arginine in DBA and glutamine in B6 and 2) codon 69 is glycine in DBA 
and serine in B6. Igfals is significantly differentially expressed in 18 of the 1 1 1 samples, 
and has two suggestive linkages on chromosomes 1 1 (lod = 2.72) and 1 8 (lod = 2.5), but 

* 

is physically mapped to chromosome 17, where no linkage is detected. One can conclude 
fix>m this that the polymorphisms in the sequence of this gene do not give rise to variation 
25 in its transcript levels, unlike those cases highlighted in Fig. 8. 

6.5. TYPES OF POLYMORPfflSMS THAT CAN BE DETECTED USING 

EXPRESSION QTL ANALYSIS 

Some embodiments of the QTL analysis performed in step 210 (Fig. 2) or step 
30 1910 (Fig. 9) are limited m the sense fliat the transcription must be polymorphic in the 
population under study in order for QTL for that transcription to be detected. However, 
the types of DNA polymorphisms that lead to transcription polymorphisms are extensive, 
and this example illustrates how QTL analysis on gene expression data is capable of 
detecting many of these polymorphisms. This example specifically includes (1) 

280 



wo 2004/061616 PCT/US2003/041613 

laenntying i^iL, lor g^ties mat nave a nigtier copy number m one parent tnan tne omer {2) 
identifying QTL associated with differential splicing between two strains (3) identifying 
QTL associated with a differentially expressed gene between two strains where 
polymoiphisms in the promoter/regulatory regions of the gene explain the differential 
5 expression, and (4) identifying QTL for genes that have a nonsense mutation in one 
parent but not the other. It will be appreciated that, in some embodiments, protein levels 
are used as quantitative traits in step 210 (Fig. 2) or step 1910 (Fig. 9) rather than 
transcription levels. 

Referring to Fig. 9, the ALAD gene is present in two copies in DBA/2J and a 
10 single copy in C57BL/6J, and the gene is known to be expressed in liver. In the F2 
generation there are three possible genotypes at the ALAD locus leading to different 
ALAD copy numbers: 1) homozygous for DBA, giving four tojtal copies of the ALAD 
gene, 2) heterozygous, giving three total copies of the ALAD gene, and 3) homozygous 
for C57BL/6J, giving two total copies of the ALAD gene. As illustrated in Fig, 9, the 
15 . differential expression due to tiie three different doses is detected in the F2 data. First, the 
gene is identified as differentially expressed between the parent and F2 strains. Second, a 
high iod score for ALAD expression that is coincident with the gene's physical location is 
. found using processing steps 202 through 210 of Fig. 2. In particular, an expression . 
statistic set 304 for the ALAD expression level is used as the quantitative trait in a QTL 
20 analysis that mouse strains as well as the phenotype data from the DBA/2 J, C57BL/6J 
cross. 

m 

Refeiring to Fig. 10, the Putative Alternative Splicing DB (PALS DB) for murine 
genes are predicted to be alternatively spliced with very high confidence. Approximately 
200 genes had a significant lod score (lod > 5.0) in the mouse data set described in 

25 Example 6.4 above (liver tissue &om 1 1 1 F2 mice constructed firom two standard inbred 
strains of mice, C57B W/6 J and DBA/2J). Probe sequences used on the arrays for each of 
the 200 genes were mapped to the sequences for those g«ies. The probes that overlapped 
the predicted splice sites were identified. Of the 200 genes with significant lod scores, 
five had predicted splice sites that overlapped probe sequences. Fig. 10 shows one of 

30 ttiese examples. The ST7 gene has a stable splice form in DBA that has an approximate 
30 base pair stretch deleted, compared to B6. The lod score curve plot in Fig. 10 
demonstrates how the QTL analysis picks up this differential splicing event, since not 
only is the gene detected to be significantiy differentially expressed in the F2 and between 
the parental strains, but this differential expression leads to a very significant QTL for the 

281 



wo 2004/061616 PCT/US2003/041613 

ST7 gene tbat is coincident with the physical location of the ST7 gene. Note that the lod 
score plot covers the entire genome in this case. In addition, there is a minor QTL on one 
of the chromosomes that happens to coincide with an enhancer binding protein that is 
known to be involved in differential splicing. So, not only can splicing events be 
5 detected, but the genetic determinants behind die alternative sphcing can begin to be 
imderstood 

Referring to Figs, 1 1 and 12, the nicotinamide N-methyltransferase gene codes for 
an enzyme that is critical to drug metabolism. Others have shown polymorphisms in the 
promoter for this gene are responsible for its differential expression between the DBA 
10 and B6 mouse strains. Table 4 demonstrates that this differential expression is detected 
since the expression levels of this gene give rise to a QTL with a lod score of 20. 1 that is 
coincident with the ph^^ical location of the gene. 

TABLE 4 



Gene Name 



Physical Gene Location QTL Locations 
(Chromosome / (Chromosome / 

Location) Location) 



QTL Peak lod 
Scores 



nicotinamide 
nucleotide • 
transhydrogenase 

9530010C24Rik 

ectonucleotide 
pyrophosphatase 

ESTAW456442 

5' nucleotidase 



13 / 64.0 cM 



' 13 / 107 cM 



8.7 



Unknown 
15 / 30.0 cM 

11 
9 



EST AW540195 

purine-nucleoside 
phosphorylase 

N-terminal Asn 

amidase 

nicotinamide N- 
methyltransferase 



5 / 25.0 cM 
14/ 19.5 cM 

16 / 8.7 cM 



9 / 29.0 cM 



aldehyde oxidase 1 i/23.2cM 



6 / 39.5 cM 
15 / 26.3 cM 

not available 
6 / 39.5 

9 / 10.0 
not available 
9/1.0 cM 

2 / 79.9 cM 

14 / 22.0 cM 
9 / 5.0 cM 

13 / 88.0 cM 
1'6/l.OcM 



2.2 
10.3 

not available 
2.5 

3.1 

not available 

« 

2.4 

2.2 

3.9 
20.1 

2.6 
2.1 



15 The pathways associated with nicotiaate and nicotinamide metabolism are fairly 

well known. Fig. 1 1 illustrates these differoit pathways. Fig. 12 provides a key for the 



282 



wo 2004/061616 PCTAJS2003/041613 

important genes that are tbund in ttie pamways illustrated in f ig. 11. l aoie 4 gives me 
physical location for these key genes in addition to any QTL for those genes represented 
on the mouse array that were detected using the expression values of those genes in QTL 
analysis (Fig. 2, steps 202 through 210). Table 4 shows that several of the genes involved 
5 m ^ms pathway have QTL co-localized with the major chromosome 9 nicotinaxnide N- 
mthyltransf erase QTL. In addition, several of the other genes in this pathway are 
polymorphic with respect to expression (nicotinamide nucleotide transhydrogenase and 
ectonucleotide pyrophosphatase), with QTL coincident with the physical gene location. 
Further, several of the other genes in this pathway have QTL co-localizing with these 
10 major QTL. The results summarized in Table 4 show tiiat the cross talk going on 

between genes in the same biochemical pathway are detectable using the combination of 
genetics and gene expression. 

None of the genes described in Table 4 colocalize as clusters in a gene expression 
cluster map (Fig. 2, step 216). Thus, analysis of a gene expression map would not have 
. 1 S tied these genes together. Rather the relationships were discovered by treating the 

expression level of each respective gene in a pluraUty of organisms as a quantitative trait 
in a QTL analysis regimen ^ig. 2, steps 202 through 210). 

Referring to Fig. 13, fhe complement component 5 gene (C5) has a two base pair 
deletion in exon 6 in the DBA strain, but not in the B 6 strain. Others have associated C5 
20 in these two strains with complex diseases, such as asthma and arthritis. The gene is 

• ■ ■ ' 

detected as differentially expressed between the two strains because the two base pair 
deletion in DBA leads to a premature stop codon, which causes the transcripts to be 
degraded more rapidly. The lod score plot in Fig. 13 covers the genetic signal for the CS 
g^e over the entire mouse genome. From Fig. 13, it seen that the only significant spike 
25 occurs at the chromosome 2 position where the CS gene physically resides. The lod score 
in this case is 28, which means that more than 90% of the variation in the CS gene in this 
F2 population is explained by the two base pair deletion. 



6.6. COLOCALIZATION OF eQTL FOR LIPID METABOLISM GENES 
30 REVEALS A QTL HOT SPOT THAT IS A POSSIBLE CAUSATIVE AGENT FOR 

THE eQTL 

In this example, mice from a C57BL/6J x DBA/2J cross were placed on a chow- 
fed diet througih foiu: months of age, and at four months various phenotypic measmrements 
were taken and the mice were then placed on a high-fat diet At six months of age, the 

283 



wo 2004/061616 PCT/US2003/041613 

mice were sacrificed and scored with respect to over sixty traits, sucli as adiposity, 
retroperitoneal fat pad, body weight, fat pad mass, omental fat pad, perimetrial fat pad, 
subcutaneous fat pad, and total cholesterol. Each of these phenotypic traits may be used 
to identify linking QTL using standard QTL analysis. Fig. 14 illustrates the results of one 

5 such QTL analysis in a region of mouse chromosome 1 1 for the phenotypic traits "free 
fatty acid" (curve 1402) and "triglyceride level" (curve 1404). Curve 1406 is the jointlod 
score curve. Expression QTL ("eQTL") (not shown in Fig. 14) from approximately 40 
genes known to be involved with glucose and lipid metabolism overlap the "free fatty 
acid" and *%iglyceride level" clinical trait QTL ("cQTL"). Fig. 15 highUghts five of 

10 these genes. Each of theise five genes has an eQTL that co-localizes with the "fatty acid" 
and "triglyceride" cQTL. 

One of the genes illustrated in Fig. 15, the peroxisome proliferator activated 
receptor 0?PAR) binding protein, has a very large QTL at this chromosome 1 1 locus 
(curve 1502). The PPAR binding protein is known to be a key co-activator for PPAR 

15 alpha, which also links to this chromosome 1 1 locus. Fig. 16 shows a scatter plot that 
breaks down the mean log ratios for the PPAR binding protein by genotype at the 
chromosome 11 location across the F2 mouse population (120 F2 mouse livers) that was 
profiled. Of note in Fig. 16 is the subtle, but consistent expression among the genotypes 
that would have been completely missed if only the diff^ential expression had been • 

20 analyzed (z.e., without the use of quantitative expression QTL analysis) because the fold 
changes range only fiom only -1.5. to 1.5. However, with the genetics, a very strong 
signal is measured due to the tightness widi which expression groups by genotype. Fig. 
17 illustrates what the plot illustrated in Fig. 16 would look like in the random case. Fig. 
17 illustrates the expression of PPAR alpha by genotype at the chromosome 15 location 

25 where the PPAR alpha gene physically resides. As can be seen by Fig. 17, the expression 
of PPAR alpha is almost completely random with respect to genotype, al&ou^ a wider 
range of e;q)ression for the B6 genotype is observed. This may be of some interest 
because changes in variation are potentially as interesting as changes in mean. 

Fig. 1 8 illustrates how genes known to be involved in Upid metabolism link to the 

30 same genetic locus, eyen though they physically reside at different locations. In Fig. 1 8, 

• ... . . . .... 

the chromosomal positions of the genes Cyp2a-12, peroxisome proliferator activated 
receptor binding protein (PP ARBP), Atf4, PPARo, and AbcqS are shown on mouse 
genome map 1802. Further, the positions of eQTL tiiat correspond to these genes are 
shown on mouse genome map 1804. Specifically, the eQTL that arise when each of the 

284 



wo 2004/061616 PCTAJS2003/041613 

genes mspped to genome map 1 802 is treated as a quantitative trait m a C^'l'L analysis is 
shown mapped to mouse genome map 1804 of Fig. 1 8. The gene PPAKBP physically 
resides at an eQTL hot spot positioned on chromosome 1 1 of genome map 1804. The 
correspondence of the physical location of PPARBP with this eQTL hot spot ii]:q)Ucates 
5 this gene as the causative agent for the eQTL at the hotspot. Thus, the data shown in Fig. 
8 suggest that PPARBP is in a biological pathway at a point that it is upstream from the 
genes Cyp2a-12, Atf4, PPARo, and AbcqS. 



6.7. ELUCIDATING GENES AND PATHWAYS FOR COMPLEX TRAITS 

10 Associating patterns of e?qpression with a clinical trait and dissecting those 

patterns by associating them with susceptibility loci, represents a potentially powerful 
way to dissect complex diseases. The present example provides a method for associating 
a gene with a clinical trait T. In some embodiments, clinical trait T is a complex trait 
complex disease). Section 5.15 diescribes the characteristics of some complex traits 

15 within the scope of the present invention. The method works by interfacing gene 

expression data with clinical trait data in order to identify potential causative genes for a 
trait and the associated pattem of response. The steps used in the method are illustrated 
in Fig. 19 and described in section 5.16, above. 

Major loci controlling complex phenotypes like obesity, heart disease and CNS • 
20 disorders may potentially aSect scores of genes, if not hundreds. It is expected that those 
genes involved in the more downstream aspects of pathways associated with common 
diseases would have eQTL linked to the major causative loci for those diseases. In 

■ 

addition, there may be heterogeneity among the causative loci for a given disease in a 
population of interest When present, this heterogeneity impacts the ability to detect 
25 linkages to tiie causative loci, since the significance of any one locus is diminished when 
the population is considered as a whole in such a setting. Therefor, the development of 
techniques that allow for the identification of homogenous subpopulations with respect to 
causative loci as provided in the present apphcation is a major advance in the elucidation 
and dissection of genetic basis for complex diseases. ; 

30 * - - • 

6.7.1. CASE STUDY USING MOUSE DATA 
The steps outlined in Fig. 19 were perfoimed using fits mouse Systran desoibed in 
Section 6.4. livers were profiled in mice after fhs inice bad been on a high-fat, 



285 



wo 2004/061616 PCTAJS2003/041613 

atherogenic diet for four months. As described by Drake et al (J. Orthop Res 19, 511-7, 
2001; Physiol Genomics 5, 205-15, 2001), such mice represent the spectrum of disease in 
a natural population, with many mice developing atherosclerotic lesions and brain lesions, 
and others having significantly higher fat-pad masses, higher cholesterol levels and larger 

5 bone structures than others in the same populatioiL Using the expression data 44 (Fig. ij 
to identify patterns that refine the definition of a clinical trait, including identifying 
subtypes of the clinical trait, and then identifying QTL for these clinical trait (cQTL) 
subtypes and linking this information with the gene expression traits to elucidate genes 
and pathways associated with the clinical traits, are the primary motivations for the 

10 beginning steps described in Fig. 19. Associating patterns of expression with a clinical 
trait and dissecting those patterns by associating them with susceptibility loci represents a 
powerful way to dissect complex diseases. 

More than one percent of the eQTL idratified genoine-wide for the 7,861 genes G 
that were used in respective QTL analysis {e.g. , instances of processing step 1910, Fig. 
15 19) fall within a 10 cM window centered at approximately lOOcM on chromosome 2 in 
. the mouse genome (Fig. 20). There are 867 genes with lod scores over 2.0 linked to this . 
region. The majority of genes Jinked to the chromosome 2 locus do not physically reside 
on chromosome 2, and so, are at least partially regulated by one or more loci in the 
chromosome 2 hot-spot region. 

20 Co-localized with this locus are many cQTL (determined by iiistances of 

processing step 1912, Fig. 19) for cUnical traits T such as adiposity, fat pad mass, plasma 
Upid levels and bone density. Fig. 20 shows the lod score curves for four of the 
obesity-related traits, the peaks of which are almost perfectly coincident with the 
hundreds of eQTL falling at that locus. The four obesity related traits are (1) 

25 subcutaneous fat pad mass (curve 2002 peaking at 105cM with a lod score = 6.25), (2) 
perimetrial fet pad mass (curve 2004 peaking at 103cM with a lod score = 5.3 1, (3) 
omental fat pad mass (curve 2006 peaking at 103cM vidth a lod score = 3.80), and (4) 
adiposity (curve 2008 peaking at 105cM with a lod score = 3.69). The joint lod score 
curve for these four clinical traits is given by line 2010, peaking at 1 .05M with a lod score 

30 = 13.02. The majority of genes linked to this region do not physically reside on 
chromosome 2, and so are at least partially regulated by one or more loci in the 
chromosome 2 hot-spot region. For the 423 gmes with mapping information, there are 
only four eQTL with lod scores greater than 3.0 that correspond to genes whose physical 
locations are within 2cM of the peak (1916-Yes, 1920, Fig. 19). The lod score curves for 

286 



wo 2004/061616 PCTAJS2003/041613 

these four potential candidate genes that may explain the chromosome 2 eQl'L hot spot 
are represented by lines 2012 in Fig. 20. From highest lod score to the lowest, the four 
candidate genes are (1) RIKEN cDNA 2610042014 (NM_025575) peaking at 103cM 
with a lod score = 24.43 (curve 2012-4), (2) ATPase, class It, type 9A (NM-015731) 

5 peaking at 105cM with a lod score = 6.13 (curve 2012-3), (3) KKEN cDNA2610100K07 
. (NM-025996) peaking at lOlcM with a lod score = 5.04 (curve 2012-2), and (4) zinc 
finger protein 64 QSIM-009564) peaking at 101 cM with a lod score = 3.56 (curve 2012- 
1). Gene NM_025575 codes for a dolichyl-diphosphooligosaccharide-protein 
glycosyltransferase and gene NM_01 573 1 codes for a cation-transporting ATPase; these 

10 genes may be considered tiie primary causative candidates for the linkage activity at the 
chromosome 2 locus. 

The class of genes represented in Fig. 20 (curves 2012), identified by. intersecting 
cQTL data with eQTL data in accordance with Fig. 19, provides convincing evidence that 
many of the genes co-localized to a single QTL hot spot are associated with the 

15 obesity-related traits. Hence, several candidate genes whose physical locations are 

coincident with their respective eQTL are reasonable candidate genes for further research. 
It may be that the causative gene is not differentially regulated and so is not detectable 
. with the methods described in this example. However, when these inventive methods are 
viewed jGcom the standpoint of hypothesis generation, the candidate genes with supporting 

20 genetic clusters offers researchers valuable insight into complex traits and suggests 
meaningful hypothesis for further validation. In this example, the combined gene 
expression/genetics approach has effectively generated interesting hypotheses by filtering 
the nunober of genes that would otherwise need to be considered fi-om 25,000 to three or 
four reasonable candidates, with hundreds of additional genes forming patterns that 

25 represent the reactive changes induced by the causative set, all of which have been 
identified in a conq)letely objective manner. 



6.7.2. HIERARCHICAL CLUSTERING 
Fig. 23 represents the results of a two-dimensional hierarchical clustering, with 
30 123 genes along the x^axis and 36 mice along the y-axis, representing the upper and lower 
25^ percentile for the subcutaneous fat pad mass trait over 72 of the 1 1 1 F2 mice that 
were scored with respect to this trait Two criteria were appUed in selecting the 123 
genes along tiie x-axis: 1) gmss in this set had to be significantly expressed and 



287 



« 

wo 2004/061616 PCT/DS2003/O41613 

differentially expressed in at least 10 mice, and 2) genes in tibis set had to liave expression 
values fliat were able to discriminate between the extreme subcutaneous fat pad mass 
groups (using standard two-sample t test and a significance level of 0.05). To compute 
the array illustrated in Fig. 23, the logio(expression ratio) was plotted as red (regions 
S 2320) when the red channel is up-regulated to the green channel and 2) green (regions 
2340) when the red channel is down-regulated relative to the green channels. White and 
gray areas in the array illustrated in Fig. 23 respectfully represent areas in which the logio 
(expression ratio) is close to zero and when data from both of the chaxmels for a given 
prove is unreliable. 

10 All genes depicted in Fig. 23 are either linked to the chromosome 2 locus 

identified in Fig. 20, or are highly correlated with genes that are linked to the region. The 
123 genes used in Fig. 23 are able to discriminate between mice with higih fat pad masses 
and those with low fat pad masses. Arrows 2302 highlight mice that have low fat pad 
mass, but a higih fat pad mass gene signature. Arrow 2304 highlights a single mouse that 

IS has high fat pad mass, but a low fat pad mass gene signature. • 

Interestingly, a group of major iirinary protein g&aes (MUPl, MUP4, and MUP5) 
are linked to the chromosome 2' locus, in addition to 7 other loci (all with lod scores 
exceeding 4.0), 4 of which co-localize with adiposity or fat pad mass traits. The MUP 
genes stand out because they are highly correlated with many other genes known to be 

20 involved in obesity-related pathways, including retinoid X receptor (RXR) gamma (R= 
0.75/P-value << 1 .OE"^^ acyl-Coenzyme A oxidase 1 0R.=O.65/P-value =3.78E"^^), and 
leptin receptor (R=-0,74/P-value « l.OE*^^), in addition to co-localizing with other genes 
like peroxisome proliferator activated receptor (PPAR) gamma, RXR interacting protein 
and LPR6, all known to be involved in these pathways. Mutations in the Leptin receptor 

25 in mice and man cause hyperphagia and extreme obesity. See, for example, Chen a/., 
1996, Cell 84, 491-495; Chua e/ a/., 1996, Science 271, 994-996; Clement et al, 1998, 
Nature 392, 398-401; Montague et aL, 1997, Nature 387, 903-908; Strobel et al, 1998, 
Nat Genet 18, 213-215; Tsigos et aL, 2002, J. Pediatr. Endocrinol Metab. 15, 241-253. 
RXR is the obligate partner of many nuclear receptors including PPARa and PPAR7 that 

30 are involved in many aspects of the control of lipid metabolism, glucose tolerance and 
insulin sensitivity. See Chawla a/., 2001, Science 294, 1866-1870. This demonstrates 
that the chromosome 2 locus identified in Fig. 20 draws together adiposity, fat pad mass, 
cholest^ol and triglyceride levels and is linked to genes with proven roles in obesity and 
diabetes. FurOier, the MUP genes are members of the Upocalin protein family and are 

288 



wo 2004/061616 PCTAJS2003/041613 

kno^ to play a central role in phermone-binding processes that aflbct mouse physiology 
and behavior. See Timm et al, 2001, Protein Science 10, 997-1004, Furthennore, MUP 
expression levels have been associated with variations in body weight, bone length, and 
VLDL levels. See, for example, Metcalfe/ a/., 2000, Nature 405, 109-1073; Swift et al, 
5 2001, J. Lipid Res. 42, 218-224; Jiang and Zeng, 1995, Genetics 140, 1 1 1 M 127. Arrows 
2306 in Fig. 23 indicate the positions of the MUPl, MUP2, and MUP3 genes. 

The region supporting the chromosome 2 locus illustrated in Fig. 20 is 
homologous to human chromosome 20ql2-ql3.12, a region that has previously been 
linked to human obesity-related phenotypes. See Borecki et ai, 1994, Obesity Research 

10 2, 213-219; Lembertas et al, 1997, J. Clin. Invest 100, 1240-1247. The human homolog . 
for genes NM_025575 (Fig. 20; curve 2012-4) and NM_015731 (Fig. 20, curve 2012-3) 
also reside in the human chromosome 20 region and have not been completely 
characterized; they have not been implicated in obesity-related traits before. While other 
genes such as melanocortin 3 receptor 0MC3R) have been suggested as possible 

15 candidates for obesity at this locus (Lembertas et al, 1997, J. Clin. Mvest 100, 1240- 
. 1247), this data suggests the genes NM_025575 (Fig. 20; curve 2012-4) and NM_01573 1 
(Fig. 20, curve 2012-3) may be responsible for the underlyiug QTL. Unlike MC3R, these 
two genes are significantly linked to. the murine chromosome 2 locus. Further, they are 
significantly correlated with several of the fat pad mass traits. Further, these two genes . 

20 are genetically interacting with several of the fat pad mass traits also linked to the 

chromosome 2 locus. It is observed that expression levels for MC3R are not linked to the . 
chromosome 2 locus illustrated in Fig. 20, and there are no SNPS annotated in the exons 
or intions of the gene between the C57/BL6 and DBA/2J strains in the most recent build 
of the Celera RefSNP database. These observations provide evidence and suggests that 

25 MC3R may not be the gene underlying chromosome 2 linkage, at least in this particular, 
system. Of course, it is possible that MC3R is only expressed in the brain, and that 
polymorphic e;q)ression of the MC3R in the brain leads to changes of expression in the 
liver. Because there are no DNA polymorphisms in this gene between the two strains that 
lead to codon changes or that likely lead to cis-acting alternative splicing polymorphisms, 

30 if it is the causative gene in this case, it would most likely have to be acting through 
transciiptidnal regulation. 



289 



wo 2004/061616 PCT/US2003/041613 

6.73. TESTING FOR PLEIOTROPY 

In some embodiments, the inventive method disclosed in Fig. 19 is extended. 
Tests developed by Jiang and Zheng (Genetics 140, 1 1 1 1-1 127, 1995) and implemented 
by Drake et al. ^Physiol. Genomics 5, 205-215, 2001) were applied to assess whether 

5 pleiotropy of a common underlying gene rather than close linkage of separate genes were 
responsible for the colocalized cQTL and eQTL in the chromosome 2 region. As set forth 
by Jiang and Zeng (Gmetics 140, 1111-1 127, 1995), to test the hypothesis of pleiotropy 
versus, close linkage for two coincident QTL of interest, the multi-trait composite interval 
mapping (CIM) (Lynch and Walsh, 1998, Genetics and Analysis of Quantitative Traits, 

10 Sunderland, MA: Sinauer Associates) is reformulated. The hypothesis of interest {H(^ and 
H{) involve the position p\ of the QTL having an effect on trait 1 and position /?2 of the 
QTL having an effect on trait 2 are given by: 

15 Hi :pi ^2 

The altemative hypothesis indicates that the QTL are nonpleiotropic and are 
located at different map positions. The likelihood for Hq is the same as that given for the 
multi-trait CIM model. However, the likelihood for the altemative is that developed by 

20 Jiang and Zeng (Genetics 140, 1111-1 127, 1995). Using the prescription set forth by 
Jiang and Zeng, calculation of the maximimi likelihoods for each hypothesis was carried 
out using the expectation-conditional maximization (ECM) algorithm. Once the 
mflYimnTTi likelihoods under each hypothesis were computed, the log ration of the 
likelihoods was conaputed to serve as the test statistic. This log-likehhood ration test 

25 statistic is asymptotic to a distribution with one degree of freedom. 

The test supported the hypothesis of pleiotropy (one allele affecting several traits) 
in that no significant results for the traits subcutaneous fat pad mass, perimetrial fiat pad 
mass, omental fat pad mass, or adiposity at the 0.05 significance level were found. The 
results obtained are consistent with pleiotropy of a common imderlying gene regulating 
30 the clinical and expression traits linked to the chromosome 2 locus. The four genes 
detailed in Fig. 20 by curves 2012-1 through 2012-4 may be considered as primary 
causative candidates for all of the linkage activity at the chromosome 2 locus. 



290 



wo 2004/061616 PCT/US2003/041613 

The majority of genes linked to the chromosome 2 region are sigmncantly 
correlated among themselves, and functional patterns emerge fiom these data that support 
the hypothesis that these genes are associated with the clinical traits linked to this region. 
As an example, 186 of the 867 genes linked to this region have been assigned to GO 

5 categories (The Gene Ontology Consortium, 2001, Genome Research 11, 1425-1433). Of 
these, 39 have been assigned to the "ATP binding" molecular fenction category. With 
4,771 genes having GO classifications and lod scores greater than 2.0, the "ATP binding" 
category occurs in 514 of these gmes. Fisher's Exact Test was used to determine if the 
"ATP binding" category is more represented in the chromosome 2 QTL cluster than 

10 would be expected by chance (p-value = 0.0000008). Such strong significance indicates 
that the high occurrence of "ATP binding" in the cluster could not have happened by 
chance. Further, subsets within tiie 39 genes are highly correlated with genes known to 
be associated with obesity related traits. These genes include Leptin receptor (correlation 
coefficient = 3.8E'^^) and RXR gamma (correlation coefficient 0.78 / pvalue = 3.8E'^^). 

15 

6.7.4. DETERMINING THE TOPOLOGY OF A BIOLOGICAL PATHWAY 

THAT AFFECTS A COMPLEX TRAIT 

The processing steps disclosed in Fig. 19 and described in Section 5.16, above, are 
used to identify the genes associated with a complex trait (e.g., the genes that affect a 
20 complex trait). This section describes how the data obtained in Section 5.16, above, can 

♦ ■ 

also be used to deduce the topology of a biological pathway that affects a complex trait. 
In particular, using Fig. 24 as an illustration, cQTL and eQTL data is analyzed in order to 
deduce the topology of such a biological pathway. 

In step 1912, the cQTL for clinical traits 1 through 4 are localized on a 
25 representative molecular map 2402 for the population under study. For example, in cases 
where the population under study is human, representative molecular map 2402 is, for 
example, a map of the human genome. In some raibodiments, molecular m^ 2402 (Fig. 
24) is a marker map, such as one stored as marker data 70 in system 10 (Fig. 1). In some 
embodiments, molecular map. 2402 includes the nucleotide sequence of a portion of the 
30 genome (e.g. , genomic map) of the population under study. 

Step 1912 of Fig. 24 (illustrated as downward arrow in the upper left side of Fig. 
24) corresponds to step 1912 of Fig. 19. In step 1912 of Fig. 24, a clinical quantitative 
trait loci (cQTL) that is hnked to a clinical trait T is identified on map 2402 with a QTL 
analysis that uses the phenotypic statistic set 2102 as the clinical trait T. In some 

291 



wo 2004/061616 PCT/US2003/041613 

embodiments, these QTL analyses are pertbrmed by an embodiment ot climcai 
quantitative trait (cQTL) identification module 2204 (Fig. 22). Referring to Fig. 24, four 
phenotypic statistics sets 2102 are shown. Each set 2102 corresponds to one of four 
clinical traits under study. It will be ^preciated that any number of clinical traits may be 
5 analyzed and that the four traits illustrated in Fig. 24 are merely exemplary. For example, 
at least 3, 5, 8, 12, 20, 30, or 40 climcai traits could be analyzed using the methods 
disclosed in Fig. 24. 

In one embodiment, the complex trait under study is obesity. In one example of 
this embodiment, cUnical trait 1 is a body mass index (e.g., weight / height^), clinical trait 

10 2 is subcutaneous fat pad mass, clinical trait 3 is insuhn level in the blood, and clinical 
trait 4 is leptin levels. Accordingly, cQTLl is a QTL that is linked to body mass index. 
cQTL2 is a QTL that is linked to subcutaneous fat pad^mass, clinical trait 3 is a QTL that 
is linked to insulin level in the blood, and cQTL 4 is a QTL that is linked to leptin levels. 
Further, cQTLl through cQTL4 are determined using the QTL analysis of step 1912 (Fig. 

15 19) as described in detail in Section 5.16, above. 

In addition to the identification of four cQTL in map 2402, which respectively 
correspond to four clinical traits associated with obesity. Fig. 24 discloses the results of a 
number of .eQTL analyses. Thecomputationof these eQTL analyses will now be 
described. In Fig. 24, four expression statistics sets 304 (Fig. 3) are illustrated. Each 

20 expression statistic set corresponds to a different gene G in the genome of the population 
und^ study. As described in detail in previous sections, each expression value in the 
expression statistic set is a measurement of a cellular constituent corresponding to a 
particular gene G in an organism in a population of organisms under study. The cellular 
constituent may be, for example, mRNA levels for the corresponding gene, protein levels 

25 for the corresponding gene, or a metabolite level that is directiy regulated by the 

A 

corresponding gene. It will be appreciated that any number of genes may be analyzed and 
that the four genes illustrated in Fig. 24 are merely exemplary. For example, at least 3, 5, 
8, 12, 20, 30, or 40 genes could be analyzed using the methods disclosed in Fig. 24. 

Each expression statistic set 304 is used as the quantitative trait in a QTL analysis 
30 in accordance with proces^uig step 1910 (Fig. 19). QTL analysis, such as those 

performed in processing step 1910, are described in detail in Section 5.16, above. A 
separate QTL analysis is performed for each of the four expr^ion statistics sets 304 
illustrated in Fig. 4. 

ft 

292 



wo 2004/061616 PCT/US2003/041613 

in some embodiments, mese kill analyses are pertoimea oy an emDooiment oi 
e3q)ression quantitative trait loci (eQTL) identification module 2202 (Fig. 22). Each 
expression statistic set 304 generates eQTL that are linked to the expression statistic set. 
Expression statistic set 304-Genel, which is the expression statistic set for gene 1, yields 

5 four eQTL (eQTLl-r, eQTLl-2, eQTLl-3, and eQTLi-4)! These four eQTL map to 
four different locations on map 2402. It will be appreciated that eQTL will map to 
various locations on map 2402 and that not all eQTL will colocalize with a cQTL. 
However, for the ease of illustration of this example, eQTLl-1*, eQTLl-2, eQTLl-3, and 
eQTLl-4 respectively co-localize with cQTLl, cQTL2, cQTL3, and cQTL4. Only one of 

10 the eQTL can correspond to the physical location of the gene G that forms the basis of the 
expression set 304 was used to compute the eQTL. For set 304-Genel, the eQTL 
denoted eQTLl-1* maps to the physical location of gene 1 in map 2402; For this reason, 
eQTLl-1 is marked with an asterisk. For the set 304-Gene4, the eQTL denoted eQTL4- . 
1* maps to the physical location of gene 4. 

15 The physical location of each eQTL for each of genes 1 through 4 is shown in Fig, 

4. Analysis of the eQTL and the cQTL allow for the determination of which of the four 
genes is the furthest upstream in a biological pathway that affects the complex trait T 
under study. Fig. 4 discloses the eQTL/cQTL relationships that are summarized in Table 
5 below. 

20 Tables ' . 

Gene Number cQTL that colocalize with Physical Location of the Gene 

an eQTL for this Gene (expressed in terms of cQTL and 

eQTL that colocalize to the location 
on map 2402) 



1 cQTLl, cQTL2, CQTL3, cQTLl/eQTLl-1 

cQTL4 

2 cQTL2, CQTL3, cQTL4 cQTL2/eQTL2-l 

3 cQTL3,cQTL4 . cQTL3/eQTL3-l 

4 cQTL4 ' cQTL4/eQTL4-l 



Referring to Fig. 24, it is seen that cQTL4 colocalizes with an eQTL for each of 
the four g^es under study. In some embodiments, an eQTL and a cQTL are considered 
colocalized if fhey fall within about 25 centiMorgans (cM) of each oflier on map 2402. Jn 
25 some embodiments, an eQTL and cQTL are considered colocalized if tiiiey fall within 
about 10 cM, about 5 cM, about 1 cM, about 0.5 cM, or about 0.1 cM of each other on 



293 



wo 2004/061616 PCT/US2003/041613 

map 2402. la some embodim^ts, an eQTL and cQTL are considered colocalized it they 
fall within about 100 kilobases, 50 kilobase, 25 kilobases, 10 kilobases, 1000 bases, or 
500 bases of each other on map 2402. None of the other cQTL colocaUze with an eQTL 
for each of the four genes under study. For example, cQTL2 only colocalizes with an 
5 eQTL for two of the genes under study, gene 1 and gene 2.^ The data shown in Fig. 24 
suggests that a gene at position cQTL4 in map 2402 is the fiirther upstream position in a 
biological pathway. The observation that the eQTL for geae 4 only colocalizes with 
cQTL4 and none of the other cQTL suggests that the identity of the upstream gene in a 
biological pathway affecting obesity is, in &ct, gene 4. 

10 Fig. 4 further suggests which gene comes after gene 4 in a biological pathway that 

affects obesity. CQTL3 colocalized with three eQTL, eQTLl-3, eQTL2-2, and eQTL3- 
1*. These eQTL are respectively linked with gene 1, gene 2, and gene 3. This suggests 
that there exists a gene that colocalizes with cQTL3 that affects at least two other genes. 
It is noted that the physical location of gene 3 is cQTL3. Further, the only other eQTL 

15 linked to gene 3 that colocalizes with a cQTL on map 2402 is eQTL3-2. But eQTL3-2 
colocalizes with cQTL4, a position that has already been determined to colocalize with 
the most upstream gene in the pathway identified by the data in Fig. 4. Thus, taken 
together, the data suggests that gene 3 is downstream from gene 4 in a biological pathway • 
that affects obesity. The data further suggests that gene 3 is upstream from genes 1 and 2. 

20 Analysis of the data is completed upon consideration of the eQTL colocalized to 

cQTLl and cQTL2. Taken together, the data illustrated in Fig. 24 suggests the following 
topology for a biological pathway: 

Gene 4 Gene 3 Gene 2 Gene 1 

In some embodiments, the analysis of data such as that disclosed in Fig. 24 is 
25 performed by an embodiment of determination module 2206 (Fig. 22). The biological 
pathway deduced in this example can be validated using techniques such ajs multivariate 
analysis. In addition, the biological pathway deduced in this example can be validated 
using techniques such as gene knock out studies. Those of skill in the art will recognize 
numerous other methods for vaUdating the proposed topology for the biological pathway 
30 affecting the complex trait, and all such methods are within the scope of the preset 
invention. 

r 

While the complex trait analyzed in this hypothetical example is obesity, it will be 
appreciated that the techniques disclosed in this section can be used to help determine the 

294 



wo 2004/061616 PCT/IIS2003/041613 

topology of biological pathways that atiect any complex trait ot interest. ISuch determines 
are &ci]itated by the choosing to analyze clinical traits ttiat are affected or influenced by 
the complex trait (e^g.y complex disease) mider study. 

The example in this section can be described as a method for determining the 
S topology of a biological pathway that ajffects a complex trait. The method has the step of 
(A), identifying one or more expression quantitative trait loci (eQTL) for a gene in a 
plurality of genes using a first quantitative trait loci (QTL) analysis. This first QTL 
analysis uses a plurality of expression statistics for the gene as a quantitative trait. Each 
expression statistic in the plurality of expression statistics represents an expression value 

1 0 for the gene in an organism in a plurality of organisms of a single species. The method 
further comprises the step of (b), repeating step (a) a first number of times, wherein each 
repetition of step (a) uses a different gene in the plurality of genes. In some 
embodiments, step (a) is repeated three or more times, hi some embodiments, step (a) is 
repeated 5 or more times, 8 or more times, 12 or more times, 20 or more times, or 100 or 

15 more times. At least some of the genes selected in iterations of step (a) are in the 

biological pathway that affects a complex trait. An advantage of the present invention is 
that genes that are not in the biological pathway can be selected in step (a) without failure 
of the method provided that some of the genes selected in iterations of step (a) are in the 
pathway. 

20 The method further comprises the step of (c), identifying a clinical quantitative 

trait loci (cQTL) that is linked to a clinical trait in a plurality of clinical traits using a 
second QTL analysis. The second QTL analysis uses a plurality of phenotypic values as a 
quantitative trait. Each phenotypic value in the plurality of phenotypic values represents 
a phenotypic value for the clinical trait in the plurality of clinical traits in an organism in 

25 the plurality of organisms. The method further comprises the step of (d), repeating step 
(c) a second number of times. Each repetition of step (c) uses a different clinical trait in a 
pluraUty of clinical traits. In some embodiments, step (c) is repeated three or more times. 
In some embodiments, step (c) is repeated five or more times, eight or more times, twelve 
or more times, twenty or more times, or one hundred or more times. An advantage of the 

30 present invention is that clinical traits that are not in fact associated with the complex trait 

' * • ' • • • ■ " - 

of interest may be selected in instances of step (a) without failure of the method provided 

that some of the clinical traits selected in iterations of step (c) are in fact indicative of 

(associated with) the complex trait. 



1 



295 



wo 2004/061616 PCT/US2003/041613 

Finally, the method comprises ttie step of (e), using (i) the identity of each eQTL, 
identified in an iteration of step (a), that colocalizes with a cQTL, identified in an 
iteration of step (c), and (ii) a physical location of each gene in the plurality of genes on a 
molecular map for the single species, in order to determine the topology of the biological 
S pathway that affects the complex trait In one embodiment st^ (e) is p^ormed by 
identifying a first eQTL. In general, this first eQTL has the property of colocalizing with 
a first cQTL identified in step (c). Furthermore, this first eQTL has the property that the 
gene used to generate the eQTL colocalizes with the physical location of the first cQTL. 
In the case where each eQTL identified tn step (a) colocalizes with more than one cQTL, 

10 then preferably an eQTL that colocalizes with the small number of cQTL (among the 

eQTL identified in step A) is identified. In such instances, the cQTL in the small number 
of cQTL that actually colocalizes with the gene used to generate the first eQTL is denoted 
as the first cQTL. Once the fibrst cQTL has been identified, a determination is made as to 
whether eQTL firom other genes in the plurality of genes also colocalize with the first 

15 cQTL. When this is the case, the hypothesis is drawn that the gene used to generate the 
first eQTL is fiirther upstream in a biological pathway affecting a complex trait than each 
of the genes that generate eQTL colocalizing with the first cQTL. This gene is therefore 
designated as the first gene. When. this is not the case a different first eQTL is identified 
using the method desoibed above. 

20 The method continues by examining each of the genes that generate eQTL that 

colocalize with the first cQTL in order to determine their topological order in a biological 
pathway. This analysis proceeds in the same manner used to identify the first cQTL. For 
example, a second gene that generates an eQTL that colocalizes with both the first cQTL 
and a second cQTL is sought. If the physical location of flie second gene colocalizes with 

25 the second cQTL, then the second gene is considered a downstream candidate in the 

biological pathway. Kthe second gene does not colocalize with the second cQTL, then a 
different second geac is identified or step (E) can recommence. Various checks can be 
perfomied on the second gene. First, a determination can be made as to whether eQTL 
firom other genes also colocalize with the second cQTL and, if so, whether they are the 

30 same genes that generated eQTL that colocalize with the first cQTL. In cases where the 
same genes are generating eQTL that colocalize with both the first cQTL and the second 
cQTL, the suggestion is raised that such genes are downstream members of a biological 
pathway that starts with the first gene and continues with the second gene. Each of these 
downstream genes can be furth^ examined using the same techniques used to identify the 

296 



wo 2004/061616 PCT/US2003/041613 

first and second genes, in order to further descnbe tiie topology ot tne dioiogicai pamway 
that affects a coiiq)lex trait. 



11 • 



6.7.5. ASSOCIATING GENES WITH TRAITS USING CROSS SPECIES DATA 
5 The present section provides an example of how the systems and methods of the 

present invention can be used to associate genes with traits using cross species data. This 
example builds upon the discovery of the four murine candidate genes identified in 
Section 6.7.1, above. In Section 6.7.1, four genes were discovered on mouse 
chromosome number 2 by co-localizing cQTL for the obesity related traits (1) 

10 subcutaneous fat pad mass (Fig. 20, curve 2002), (2) perimetrial fat pad mass (Fig. 20, 
curve 2004), (3) omental fat pad mass (Fig. 20, curve 2006), and (4) adiposity (Fig. 20, 
curve 2008) with four eQTL with lod scores greater than 3.0 that correspond to genes 
whose physical locations are within the vicinity (e.g^., 2 cM) of the four cQTL. The foiu: 
mouse genes are (1) RIKEN cDNA 2610042014 (NM_025575) (Fig. 20, curve 2012-4), 

15 (2) ATPase, class It, type 9A (NM-01 573 1) (Fig. 20, curve 2012-3), (3) RKEN 

CDNA2610100K07 CNM-025996) (Fig. 20, curve 2012-2), and (4) zinc finger protein 64 
(NM-009564) (Fig. 20, curve 2012-1). 

1 • • • ' 

The region of mouse chromosome 2 in which curves 2002 through 2012 are found 
is homologous to human chromosome region 20ql2-ql3.12. This region of the human 

20 genome has previously been liiilced to human obesity-related phenotypes. See, for 

example, Borecki et al, 1994, Obesity Research 2, 213-219 and Lembertas et al, 1997, J. 
Clin. Investigation 100, 1240-7. The data described in section 6.7.1. strongly suggests 
that the human genes in human chromosome region 20ql2-ql3.12 that correspond to the 
mouse genes are associated with obesity. Therefore, following the methods and systems 

25 of the present invention, the human genes that correspond to the four mouse genes 
identified in Section 6.7.1, above, were characterized. A summary of this 
characterization is provided in Tables 2 and 3 below. In Table 6, the nucleotide 
information for the four mouse genes and the four corresponding human genes is 
provided. In Table 4, the protdn products of the four mouse genes and the four 

30 - corresponding human genes is provided. 



297 



wo 2004/061616 



PCT/US2003/0416i3 



TABLE 6 ■ Obesity related genes of the present invention 



Mouse Sequence 



Curve 
Nuoiber in 
Figure 20 



Mouse Gene 
Name 



Human 
Sequence 



Human 

Gene 

Name 



NM 025575 
(SEQEDNO:!) 


2012-4 


2610042014 
geae • ■■ - 


Corrected form 
--of^591714 
(SEQ ID NO: 2) 

Coding region 
only (SEQ ID 

NO- 'W 


Not named 


NM_015731 


2012-3 

- 


ATP9A/ 


Obtained by 

oXL^IJLLUClil \jx 

protdn 075110 
to human 
chromosome 20 
(SEQ ID NO: 
12) 


ATP_2A 


MM 025996 
(SEQ ID NO: 13) 


2012-2 


Not named 

• 


NM 006809 
(SEQ ID NO: 
16) 


Tomm34, 
or Tom 34 


NM 009564 
(SEQ K) NO: 19) 


2012-1 


Z§)64 

t 


NM 018197 
(SEQ ID NO: 
20) 


Z^64 



TABLE 4 - Obesity relate gene products of the present invention 



Mouse Sequence 



Curve 
Number in 
Figure 20 



Mousie 

Protein 

Name 



Human Sequence 



Human 
Protein 
Name 



NP_079851 
(SEQ ID NO: 29) 

Q9CQK0 
(SEQ ID NO: 4) 

Q9CYM5 
(SEQ ID NO: 5) 

Q9CYX5 
(SEQ ID NO: 6) 

Q9D AU8 
(SEQ ID NO: 7) 

Q9CVJ3 

(SEQ ID NO: 28) 



2012-4 



Not Translated firom 
named SEQ ID NO: 2 
(SEQ ID NO: 8) 



Unknown 



f 



298 



wo 2004/061616 



PCT/US2003/041613 



Mouse Sequence 


Curve 
Number in 

■ 

Figure 20 


Mouse 

Protein 

Name 


Human Sequence 


Human 
Protein 
Name 


NP 056546 

(SEQ ID NO: 10) 


2012-3 

m 


ATPase 
9A, 
class n 


075110 

(SEQ ID NO: 1 1) 


Phospholipi 
d- 

transporting 
ATPase HA 


NP 080272 
(SEQ ID NO: 14) 

TR Q9CYG7 
(SEQ ID NO: 17) 


2012-2 


Not 
named 


NP 006800 
(SEQ ID NO: 15) 

TR Q15785 
(SEQ ID NO: 18) 


translocase 
of outer 
mitochondri 
al membrane 
34 


NP 033590 
(SEQ ID NO: 21) • 

TR P97365 
(SEQ ID NO: 22) 


2012-1 


Zinc 
finger 
protein 
64 


NP 0o0oo7 
(SEQ ID NO: 25) 

m Q9NPA5 
(SEQ ID NO: 26) 


• t* 
zmc nnger 

protein 64 

homolog 

(mouse) 


TR Q99KE8 
(SEQ ID NO: 23) 






TR Q9NTS7 
(SEQ ID NO: 27) 





TR__Q9CWR3 
(SEQ ID NO: 24) 



6.7.5.1. NM.025575 / NP_0798S1 

The nucleotide sequence for the Mm musculus gene NM_025575 (Fig. 20, curve 

5 2012-4) is provided m Fig. 27 (SEQ ID NO: 1). There is no human gene given in . 

LocusLink for this mouse sequence (step 2504-No, Fig. 25). A BlastN search (step 2506, 

Fig. 25) indicates that AL591714.1 is the best candidate human mRNA for the mouse 

sequence NM_025575 (expectation value of 0.0). The hxunan protein product of 

AL591714.1 is CAC39448.1. A BlastP search using the translated amino acid sequence 

» 

10 of the mouse sequence NM_025575 (NP_07985 1) identifies the hmnan protein 

CAC39448.1 as the second best hit (expectation score 9e'*^ (step 2508, Fig. 25). A 
pairwise BlastP search between the human protein CAC39448 and the mouse protein, 
NP07985 1 , and a translated Blast search of the human RNA nucleotide database 
sequences against NP_079851 indicates the presence of a firame-shift in the mRNA 

15 database sequmce AL_591714. In the BlastP s^rch, only the first 42 amino acids of 
NP_079851 and CAC39448 match. A TblastN search of the human AL591714.1 mRNA 
with the query mouse protein, NP_07985 1, covers the entire iCTgth of the mouse protein, 
but in two fiagments. Tak^ together this data indicates that a firameshift occurs at 

299 



wo 2004/061616 PCT/US2003/041613 

position 241 of AL59I714.L Theretbre, the correct sequence tor the mRNA that encodes 
for a protein that is a human analog of the mouse sequence NM_02SS75 is illustrated in 
Fig. 28 (SEQ ID NO: 2). This corrected sequence removes two nucleotides flhat are 
present in the AL_591714 database sequence. In pardcxilar, the portion of SEQ ID NO: 2 
5 that codes for the human protein that corresponds to the mouse protein NP_07985 1 is 
from 1 15 nts to 569 nts of SEQ ID NO: 3. This sequence is shown Fig. 29 (SEQ ID NO: 
3). A BlastX search of SEQ ID NO: 3 against mouse proteins yields the protein 
NP_079851 as the best match, establishing that SEQ ID NO: 3 is the best homolog for the 
mduse protein NP_07985 1 , 

10 There are five complete entries in the TrEMBL database (Bairoch and Apweiler, 

2000, *The SWISS-PROT protein sequence database and its supplement TrEMBL in 
2000," Nucleic Acids Res. 28, 45-48) for the Mus musadus amino acid sequence that 
corresponds to the 2610042O14Rik gene (SEQ K) NO: 1). The TrEMBL accession 
numbers for these five protems are Q9CQK0 (Fig. 30A, SEQ ID NO: 4), Q9CYM5 (Fig. 

15 SOB, SEQ ID NO: 5), Q9CYX5 (Fig. 30C, SEQ E) NO: 6), Q9DAU8 (Fig. 30D, SEQ ID 
NO: 7), and Q9CVJ3 (Fig. 30E, SEQ ID NO: 28). The human amino acid sequence that 
corresponds to the corrected form of accession number AL591714 (SEQ ID NO: 2) is 
provided in Fig. 3 1 (SEQ ID NO: 8). A BlastP search with SEQ ID NO: 8 against the 
mouse protein database gives &e mouse protein NP_0798S 1 as the best match (stqp 25 14, 

20 Fig. 25), establishing that SEQ ID NO: 8 is the human analog of the mouse protein 
NP_07985 1 . The sequence for the mouse protein NP_07985 1 is provided in Fig. 30F 
(SEQ ID NO: 29). 

6.7.5.2. NM_015731 / NP_056546 
25 The nucleotide sequence for the Mus musculus gene NM_015731 (Fig. 20, curve 

2012-3) is provided in Fig, 32 (SEQ ID NO: 9). The mouse gene is characterized as 
ATP9A in Genbank LocusLink. An alternate gene symbol used for this gene is 
KIAA06111. 

There are two records in the human relationships field of the LocusLink record 
30 (Fig. 33). Both of these records lead to the same LocusLink record. The cytogenic 

location of the gene in the human LocusLink record is 20ql3:l 1-13.2. There are several 
nucleotide and protein sequences associated with human chromosome position 20ql3: 1 1- 
13.2 (Fig. 34). A BLAST search of the mouse refseq protein against the two human 



300 



wo 2004/061616 PCT/US2003/041613 

Ai jf yA proteins in me Locusi^nK xecom ^nameiy, i^AAJ i^50 ana AAnjiou44j was 
perfoimed. No significant homology was found between tiie mouse re&eq protein 
(NP_0S6S46) and human AAH16044. Homology was found between the mouse re&eq 

protein (NP_056546) and human BAA31586 (not shown). A BlastN search using the 

..... • ... 

5 four human sequences in the LosusLink record for ATP9A (Fig. 34, ABO 145 11.1, 

AK025559, AK026513, and BC016044) against the mouse refeeq sequence indicates that 
only AB01451 1.1 is similar to NM_015731. The other three sequences are not similar to 
NM_015731. 

The amino acid sequence that corresponds to the Mus musculus gene NM_01573 1 
10 is NP_056546. NP_056546 is provided in Fig. 35 (SEQ ID NO: 10). A BlastP search 
using the mouse refeeq protein CNP_056546) gives the himian protein 075 1 10 as the best 
candidate ortholog, which is provided in Fig. 36 (SEQ ID NO: 1 1). 

A BlastN search of the human nucleotide database with the mouse RefSeq xnKNA 
was performed. This search identified the AB0145 11.1 sequence identified in the 
15 Ix)cusLink record described above. AB014511.1 corresponds to the human protein 
BAA31568, thus confirming that this protein is the ortholog of the Mus musculus gene 
NM_0 15731. However, this is pnly a partial mRNA sequence. 

The best possible mRNA sequence for the protein 075 1 10 can only be obtained by 
genomic alignment of the protein against the human genomic sequence (MCBI assembly, 

20 bmld 30, June 2002). The human chromosome map location of the protdn BAA31586, 
which comes firom LocusLink (Fig. 34), and 0751 10, which come firom the aligmhent of 
the mouse protein NP_056546 (SEQ ID NO: 10) against the human genome NCBI 
assembly, are both at position 20ql3.2. la summary, the protein 085110 (SEQ ID NO: 
1 1) is the best human ortholog for the mouse protein NP_056546 (SEQ ID NO: 10). 

25 However, its fuU mRNA is not available. The best existing mRNA sequence 
corresponding to the mouse protein is AB01451 1.1 (whose protein product is 
BAA31568). AB014511.1 is apartial mRNA sequence. The fiiU mRNA sequence for 
the protein 075 110 may be inferred ftom a genomic alignment of the protein against the 
human genome assembly. The inferred mRNA sequence is given in Fig. 37 as (SBQ ID 

30. NO: 12).. . - 



301 



wo 2004/061616 PCT/IIS2003/041613 

6.7.5.3. iNM_oii»yo / nrj}wz/z 
The nucleotide sequence for the Mus musculus gene NM_025996 (Fig. 20, curve 
2012-2) is provided in Fig. 38 (SEQ ID NO: 13). The LocusLink record for NM_025996 
is provided in Figs. 39A and 39B. The LocusLink record indicates that the mouse protein 
5 that corresponds to NM_025996 (SEQ ID NO: 13) is NP1080272. NP_080272 is 
provided in Fig. 40 (SEQ ID NO: 14). 

A BlastP search of human proteins using the mouse protein NP_080272 (SEQ ID 
NO: 14) yields the human protem NP_006800 as the best hit The human protein 
NP_006800 is provided in Fig. 41 (SEQ ID NO: 15). NP_006800 is also impUed by 
10 LocusLink (Fig. 38B). A BlasfN search of human nucleotide sequences with the mouse 
refeeq mRNA (NM_080272, SEQ ID NO: 14) gives the human refseq, NMJ)06809.2 
(Fig. 42, SEQ ID NO: 16) as the best hit. This human refeeq sequence is also implied in 
the human-homology relationship information in LocusLmk (Fig. 38B). 

A BlastP search of mouse proteins with the human protein NP_006800 (SEQ ID 
15 NO: 15) yields the mouse protein NP_080272 (SEQ ID NO: 14) as the best hit. A BlastN 
. , search of mouse nucleotide sequences with the human mRNA NM_006809 (SEQ ID NO: 
16) yields the mouse mRNA, NM_25996 (SEQ ID NO: 13) as the best hit Therefore, 
based on LocusLink relationships, and BlastP and BlastN searches performed in 
accordance with Fig. 25, the human orthologs for the mouse seqiiences NM_025996 
20 (SEQ ID NO: 1 3) / NP_080272 (SEQ ID NO: 14) are NM_006809 (SEQ ID NO: 16) and 
NP_006800 (SEQ ID NO: 1 5). 

In addition to the NP_080272 (SEQ ID NO: 14) entry for the Mus musculus 
protem sequence, the TrEMBL database includes the entry TR_Q9CYG7 that 
corresponds to the Mus musculus nucleotide sequence NM_025996 (SEQ ID NO: 9). The 
25 Mus musculus protein sequence TR_Q9CY G7 is provided in Fig. 43 (SEQ ID NO: 1 7). 
In addition to the NP_006800 (SEQ ID NO: 15) entry, the TrEMBL database includes the 
human protein TR_Q15785 that corresponds to the human nucleotide sequaice 
NM_006809 (SEQ ID NO: 16). The entry TR_Q15785 is provided in Fig. 44 (SEQ ID 
NO: 18). 



30 



6.7.5.4. NM_009564 / NP_033590 
The nucleotide sequence for the Mus musculus gene NM_009564 (Fig. 20, curve 
2012-1) is provided in Fig. 45 (SEQ ID NO: 19). The human nucleotide sequence that 

302 



wo 2004/061616 FCT/US2003/041613 

corresponds to NM_009564 is NM_018197.1. NM_018197.1 is provicled inFig. 46 
(SEQ ID NO: 20). The Mus musadus amino acid sequence that corresponds to the Mus 
musculus g&ie NM_009564 is NP_033590. NP_033590 is provided in Fig. 47 (SEQ ID 
NO: 21). Id addition to the NP_033590 entry, the TrEMBL database includes three 
5 amino acid sequ^ces that correspond to the Mus musculus gene NM_009S64. They are 
TR_P97365 (Fig. 48, SEQ ID NO: 22), ■IR_Q99KE8 (Fig. 49, SEQ ID NO: 23), and 
'm_Q9CWR3 (Fig. 50, SEQ ID NO: 24). 

The amino acid sequence that corresponds to the human nucleotide sequence 
NM_018197 (SEQ ID NO: 20) is NP_060667. NP_060667 is provided in Fig. 51 (SEQ 

10 ID NO: 25). The human protein NP_060667 (SEQ ID NO: 25) and the mRNA 
NM_01 8197 (SEQ ID NO: 20) are indicated as the human homologs to the mouse 
sequences NM_009564 (SEQ ID NO: 19) / NP_033590 (SEQ ID NO: 21) by LocusLink. 
Furthermore, a Blast? search of human proteins with the mouse protein NP_033590 (SEQ 
ID NO: 21) yields the human protein NP_060667 (SEQ ID NO: 25) as liie best hit. 

15 Further, a BlastP search of mouse proteins with the human protein NP_060667 (SEQ ED 
NO: 25) yields the mouse protein NP_033590 (SEQ ID NO: 21) as the best mouse 
sequence. A BlastN search of human nucleotides witli mouse NM_009564 (SEQ ID NO: 
19) yields NM_018197 (SEQ ID NO: 20) as the best hit Again, a BlastN search of 
human nucleotides with human NM_018197 (SEQ ID NO: 20) gives the mouse sequence 

20 NM_009564 (SEQ ID NO: 19) as the best hit. Therefore, from the Blast results and the 
LocusLink homology information, NM_018197 (SEQ ID NO: 20) and NP_009564 (SEQ 
ID NO: 19) are the human orthologs of the mouse sequences NM_009564 (SEQ ID NO: 
19) / NP_033590 (SEQ ID NO: 21). 

In addition to the NP_060667 (SEQ ID NO: 25) entry, the TrEMBL database 
25 includes the entry TR_Q9NPA5 and TR_Q9NTS7 that correspond to the human 

nucleotide sequence NM_018197 (SEQ ID NO: 20). The human amino acid sequence 
TR_Q9NPA5 is provided in Fig. 52 (SEQ ID NO: 26). The human amino acid sequence 
TR_Q9NTS7 is provided in Fig. 53 (SEQ ID NO: 27). 



30 6.8. TARGET VALIDATION USING CROSS-SPECIES DATA 

The utility of the cross species approach described in Section 5.19 for elucidating 
complex diseases and directly identifying targets for complex diseases is demonstrated by 
examining the pattem of eQTL linking to one of the major obesity loci identified in the 



303 



wo 2004/061616 PCTAJS2003/041613 

BXD cross described by Scliadt et aL (Nature, ZUUi, 422, pp. 2y7-3U2;. in mis cross, jf 2 
mice were constructed from two standard inbred strains, C57BL/6J and DBA/2J. Figures 
86A-86D depict cQTL for several clinical traits in the BXD cross (Fig. 86A, plasma 
insulin levels; Fig. 86B, epidemial fat pad mass; Fig. 86C, plasma leptin levels; Fig. 86D, 
5 HDL levels) located on murine chromosome 13. The Tod score curves for these traits 
support two cQTL at genetic positions 85cM and 1 lOcM on mouse chromosome 13. 

Liver tissues from 1 11 F2 mice constructed from strains C57BL/6J and DBA/2J 
were proJOled using a mouse gene oligonucleotide array. The expression values from 
these experiments were treated as quantitative traits and carried througji a linkage analysis 

10 using evenly spaced markers across the autosomal chromosomes. Of the 23,574 genes 
represented on the microarray, 7,861 were detected as significantly differentially 
expressed (type I error = 0.05) in the parmtal strains or in at least ten percent of the F2 
mice profiled Using standard interval mapping techniques, quantitative trait loci (QTL) 
with logs of the odds ratio (LOD) scores greater than 4.3 (P-value < 0.00005) were 

15 idmtified for 2,123 genes, with a maximum LOD score of 80.0. 

Figure 87 highlights a subset of the genes whose expression in the Uver of the 
BXD animals is controlled by the chromosome 13 cQTL given in Figure 86 (z.c., these 
genes have eQTL at either the 85cM cQTL or 1 1 OcM cQTL given in Figure 86). These 
genes were identified in one of two ways: (1) the gene had an eQTL at or near the 85cM 
20 and llOcM locations and the eQTL was cis-^acting (C330026N13Rik or 1810058I14Rik) 
or (2) the gene had an eQTL at either 85cM or 1 lOcM of mouse chromosome 13,, but the 
gene was physically located on a chromosome other than 13, and was a draggable target 
(CCKar, Foxc2, LepR, DPP IV, LIFE, CPTla, CB2R, Orexm, CTEl, and RXRg). Here, 

r 

the definition of a draggable target set forth in Hopkins and Groom, 2002, Nature 
25 Reviews 1, 727, was used. By cis-acting, it is meant that the gene physically resides at 
the eQTL location for the gene. Those genes that are selected under the second criterion 
were tested for cauasality with respect to any of the four clinical traits depicted in Figure 
86 using the techniques disclosed in Section 5.19.4. Of these genes, four tested causal 
(CCKAR, F0XC2, DPP4, LEPR). These four genes were subsequently mapped to 
30 orthologous genes in the hmnan genome using standard techniques to determine if any of 
these genes were coincid^t with linkages for obesity-related traits in an Icelandic 
population for which phenotypic information was available for several generations. 



304 



wo 2004/061616 PCT/US2003/041613 

The above referenced Icelandic population consisted of large ratended lamilies in 
the Icelandic population. The body mass indices (BMI) in these families were assessed, 
and linkage analysis was carried out by considering obesity and thinness as qualitative 
traits (defined as BMI > 34 for obesity and BMI < 21 for thinness). Gender based 
5 differences were considered and observed to contribute to lod scores for obesity overall. 
The obesity and thinness traits were used in a standard linkage analysis in order to 
identify genetic loci controlling for each trait. When obesity for females only was 
considered, a significant locus was identified on human chromosome 4. Only 1 of the 4 
genes, cholecystokinin type A recqptor (CCKAR; Fig. 88, SEQ ID NO: 30), Uhich et al, 

10 1993, Biochem. Biophys. Res. Commun. 193, 204-21 1 fell within a one lod-score drop of 
this obesity locus. Here, the term one lod-score drop means that portion of a locus that is 
within one lod score unit of the maximum lod score value of the locus. For example, if a 
locus has a maximum lod score of 8, that portion of the locus centered .around this 
maximum value that has a lod score value in excess of 7 is considered to be within a one 

IS lod-score drop. Figure 89 highlights a lod score curve on human chromosome 4 for 

percent body fat in females in the Icelandic population. As shown in Figure 89, contained 
within a 1 lod score drop of the peak of this lod score are 29 g^es, includuig the gene 
CCKAR. 

■ ■ , 

As a result of the intersection between CCKAR and the chromosome 4 locus firom 
20 females in the Icelandic population, the gene was resequenced in 752 subjects in the 

Icelandic population. The 752 subjects were made up of 282 females with percent body 
fat in the upper 15th percentile in the Icelandic population, 208 females with BMIs over 
40 (morbidly obese), and then 282 females older &an 40 years of age and with BMIs less 
than 20 (considered thin). The CCKAR gene was resequenced in these individuals to 
25 identify single nucleotide polymorphisms (SNPs) that could be used in an association test 
to determine if any given SNP or constellation of SNPs were associated with obesity. 

Figure 89 highlights ttiat there were 55 cormnon SNPs (minor allele frequency 
greater than two percent) found ia the CCKAR gene. Haplotypes for these SNPs were 
constructed using standard techniques, and then each haplotype was tested for association 
30 to percent body fat and to thinness phenotypes. Figures 90 and 91 highhght the results of 
the association tests. 

Table 9002 of figure 90 highlights statistics for two related haplotypes that are 
strongly associated with percent body fat in females. Each row of Table 9002 represents 

305 



wo 2004/061616 PCT/US2003/O41613 

the statistics for one of the two respective haplotypes. For the lirst naplotype, tne 
statistics are based on sampling a total of 281 afiQicted (N^afiT; obese) and 282 controls 
(N_ctrl). For the second haplotype, the statistics are based on a sampling of a total of 28 1 
aJSElicted and 279 controls. As can be seen by the statistics in Table 9002, the frequency 
5 of these haplotypes in the obese population (N_aff) compared to the frequency of these 
same haplotypes in the thin population (control; N_ctrl) is significantly different: three 
percent in the thin population (Ctrl_freq) and eleven percent in the obese population 
(Aff_frq). After correcting for multiple testing, the p-value (P_cor) for this association is 
0.002, which is considered a very significant association in this setting. 

10 A different set of haplotypes was also identified in the CCKAR gene that was 

significantly associated with thinness (see Figure 91). Each row of Table 9102 represents 
the statistics for one of the two respective haplotypes. For the first haplotype, the 
statistics are based on sampling a total of 282 afQicted (N_a£f; thin) and 421 controls 
(N_ctrl). For the second haplotype, the statistics aire based on a sampling of a total of 282 

IS afiQicted and 421 controls. In combination, the two haplotypes in the thin (control) 
population for which statistics are provided in Table 9102 of Figure 91 were seen in 
seventeen percent of the subjects, while these same two haplotypes were only seen in four 
percent of the subjects in the obese population (p-value corrected for multiple testing 
equal to a.02). 

20 This section provides an example of the second cross-species approach outlined in 

Section 5.19. hi this example, a locus for obesity was identified on human chromosome 
4. Further, the ortholog of a gene that falls within tbis locus, CCKAE, was found to be 
causal for a trait in inice that corresponds to human obesity. This cross species 
information makes CCKAR a top target, and the results of this testing are significant in 

25 that DNA variations in the CCKAR gene in a cohort of obese and thin individuals were 
sem to significantly associate with obesity and thinness. 



6.9. CONCLUDING REMARKS 

Quantitative trait analyses on gene expression data have been described for the 
30 most comprehensive analysis of the genetics of gen& expression described to date in 

mouse and humans. The present invmtion delineates the type of information that may be 
obtained by intersecting two important sources of biological information, gene expression 
data fi-om microarray experiments and DNA variation data in segregating populations. 



306 



wo 2004/061616 PCT/US2003/041613 

The identification of eQTL for genes expressed in a representative tissues provides 
insist into the genetic networks that constitute the complexity of Uving systems. The 
genes with very significant eQTL in the mouse or significant heritabiUties in human 
provide a list of interesting targets for many complex phenotypes, given the strong degree 
S of genetic control observed in what can be considered naturally segregating populations. 
The several hundred genes with lod scores exceeding 20 in mouse that are described 
above represent a new class of quantitative traits, with linkage significance not commonly 
seen before in mammalian systems. 

As detailed herein, the potential to provide clustering information in the genetics 
10 dimension to help elucidate gene Amotion in complex systems is a powerfiil tool. The 
causal nature of gmetics allows for the anchoring of multiple genes under the common 
control of a single or multiple loci, as shown in Fig. 20, thereby providing roots for the 
graphs that can more completely depict the complicated network of gene interactions at 
play in complex phenotypes. Genes that are highly correlated with respect to expression 
15 will usually have linkage regions in common. Genes fliat are not higjily correlated, but 
under the control of a common locus, will be appropriately clustered together in a linkage 
analysis, thereby allowing identification of gene interactions in a novel way. 

The class of genes discussed in relation to Fig. 59 and Fig. 20 above provides 
objective evidence that many of the genes co-localized to a single QTL hot spot are 
20 associated with the obesity-related traits. The patterns of expression serve to refine the 
obesity phenotype and allow for the enrichment of subpopulations that are homogeneous 
with respect to the underlying causes of obesity in the population. Identifying such 
subpopulations has significant consequences for drug discovery, since each subpopulation 
may be more efifectively treated by a compound that targets a pathway specifically 

* 

25 associated with the disease in that subpopulation. An optimal strategy for the treatment 
of a given conunon disease may be a panel of drugs targeting more homogeneous 
subpopulations that have been objectively identified using the combination of gene 
expression, g^etics, and clinical data. 

Several candidate genes for the chromosome 2 FPM QTL (Section 5.19) whose 
30 physical locations are coincident with their respective eQTX are reasonable candidate 
genes for further research. When these methods of the present invention are viewed &om 
the standpoint of hypothesis generation, the candidate genes with supporting genetic 
clusters offer researchers possible insigihts into the complex traits and suggest meaningfiil 



307 



wo 2004/061616 PCT/US2003/041613 

nypottieses tor turttier vaiiaation. in ims example, me comomea gene 
expressioii/genetics approach has effectively generated interesting hypotheses by filtering 
the numb^ of genes that would otherwise need to be considered from 25,000, to 3 or 4 
reasonable candidates, with hundreds of additional genes forming patterns that represent 
' S the reactive changes induced by the causative gene set, all of which have been identified 
in a completely objective manner. 

For the last century genetics has been used to identify regions in the genome 
"causing" variation in a given trait. For the past decade gene expression has been used to 
identify those genes that are co-regulated over some number of conditions, presenting 

10 patterns of expression that help elucidate those genes involved in complex traits. The two 
combined approaches have the power to refine the definition of complex phenotypes, 
identify subtypes within a given phenotype, and uncover pathways associated with the 
phenotype in an unprecedented maimer. The potential exists to impact the more 
significant rate-limiting steps in the drug discovery process: Objectively classifying 

15 individuals according to disease subtypes and identifying the drivers of the pathways, the 
causal factors, underlying those disease subtypes. In the past, dissecting complex traits . 
• using genetics has met with limited success, and up to now, gene expression has appeared 

, as an indirect marker for complex traits, so that others have settled for fimctional 

. • • • * ' 

uncertainty by restricting attention to the use of DNA markers in identifying the causal 
20 &ctors for complex traijts. We have demonstrated that the combination of gene 

expression and genetics data has the potential to overcome these baniers. The addition of 
gene expression data can be used to refine the disease phenotype, directly implicate 
pathways and genes comprising those pathways associated with the disease phenotype, 
and identify the key drivers of the pathways underlying the disease phenotype. Key 
25 pathway drivers can potentially be identified even in cases where these drivers are not 
expressed in the tissues profiled, since such key genes may be expressed in one tissue, yet 
drive patterns of expressions in different tissues. In such cases, transcript abundance's of 
those g^es comprising the expression patterns in the profiled tissues will be genetically 
linked to the physical location of the gene driving their expression. 

30 The large-scale consideration of molecular phenotypes as quantitative traits is not 

tied to microanay data, but can be applied to any form of molecular phenotyping data, 
where the main aim is to consider large classes of phenotypes simultaneously in the 
context of genetics to elucidate genes and pathways for conq>lex diseases. Problems of 
multiple testing, pattern recognition, more advanced multivariate statistical genetics 

308 



wo 2004/061616 PCT/US2003/041613 

tectmiques and more efficient data-integration sciiemes will need to be vigorously 
pursued to derive the most from this type of data. However, for now, the identification of 
genes under genetic control, along with Hie loci that exhibit that control, provide a first 
step in reaching the ultimate goal of piecing together genetic networks for use in 
'5 dissecting the etiology of complex traits. 

7. REFERENCES CITED 
All references cited herein are incorporated herein by reference in their entirety 
and for all purposes to the same extent as if each individual publication or patent or patent 
10 application was specifically and individually radicated to be incorporated by reference in 
its entirety for all purposes. 

The present invention can be implemented as a con^uter program product that 
comprises a computer program mechanism embedded in a computer readable storage 
mediimi. For instance, the computer program product could contain the program modules 
1 5 shown in Fig. 1 . These program modules may be stored on a CD-ROM, magnetic disk 
storage product, or any other computer readable data or program storage product The 

I 

software modules in the computer program product may also be distributed electronically, 
via the tatemet or otherwise, by transmission of a computer data signal (in which the 
software modules are embedded) on a carrier wave. 

4 

20 . . Many modifications and variations of this invention can be made without- 

departing &om its spirit and scope, as will be apparent to those skilled in the art The 
specific embodiments described herein are offered by way of example only, and the 
invention is to be limited only by the terms of the appended claims, along with the full 
scope of equivalents to which such claims are entitled. 



309 



wo 2004/061616 PCT/CS2003/041613 

" IVHAT IS CLAIMED IS: 

1 . A method for associating a gene G in the genome of a first species with a 
clinical trait T exhibited by said first species and a second species, the method 
comprising: 

5 (a) identifying an ^ressipn quantitative trait loci (eQTL) for a gene in said 

second species that is an ortholog of said gene G using a first quantitative trait loci (QTL) 
analysis, wherein said first QTL analysis uses a plurality of expression statistics for said 
gene G' as a quantitative trait, wherein each expression statistic in said plurality of 
expression statistics represmts an expression value for said gene G^ in an organism in a 
10 pluraUty of orgaxusms of said second species; 

(b) identifying a clinical quantitative trait loci (cQTL) that is linked to said clinical 
trait T using a second QTL analysis, wherein said second QTL analysis uses a plurality of 
phenotypic values as a quantitative trait, wherein each phenotypic value in said plurality 
of phenotypic values represents a phenotypic value for said clinical trait T in an organism 

IS in said plurality of organisms of said second species; and 

(c) determining whether said eQTL and said cQTL colocalize to the same locus in 

• • • ' 

the genome of said second species, wherein, when said eQTL and said cQTL colocalize 
to the same locus, said gene G is associated with said clinical trait T in said first species. 

■ 

20 2. The method of claim 1, wherein said deteriruning step (c) further 

comprises determining whether the locus of said eQTL in the genome of said second 
species corresponds to the physical location of said gene G' in the genome of said second 
species, wherein, when said locus of said eQTL in the genome of said second species 
corresponds to the physical location of said gene G' in the genome of said second species 

25 said grae G is associated with said clinical trait T. 

3. The method of claim 2, wherein said eQTL corresponds to the physical 
location of said gene G' when the eQTL and said gene G' colocalize within 3cM of each 
other in the genome of said second species. 

30 

* 4. The method of claim 2, wherem said eQTL corresponds to the 
location of said gene G' when the eQTL and said gene G' colocalize within IcM of each 
other in the gmome of said second species. 



310 



wo 2004/061616 PCTAJS2003/041613 

5 . The method of claim 1 , the method further comprising testmg wnether a 

* 

colocalization of said eQTL and said cQTL is caused by pleiotropy. 

6. The method of claim 1, wherein said first QTL analysis and said second 
5 QTL analysis each uses a genetic marker map that represents the genome of said second 

species. 

7. The method of claim 6, which further comprises, prior to said identifying 
step (a), a step of constructing said genetic marker map firom a set of genetic markers 

10 associated with a plurality of organisms representing said second species. 

8. The method of claim 7, wherein said set of genetic markers comprises 
single nucleotide polymorphisms (SNPs), microsatellite markers, restriction firagment 
length polymorphisms, short tandem rq)eats, DNA methylation markers, sequence length 

15 polymorphisms, random amphfied polymorphic DNA, amplified firagment length 
polymorphisms, or simple sequence repeats. 

9. The method of claim 7, wherein genotype data is used in said constructing 
step and wherein said genotype data comprises knowledge of which alleles, for each 

20 marker in said set of genetic markers, are present in each organism in said plurality of 
organisms representing said second species. 

10. The method of claim 7, wherein said plurality of organisms representing 
said second species represents a segregating population and pedigree data is used in said 

25 constructing step, and wherein said pedigree data shows one or more, relationships 
between organisms in said plurahty of organisms representing said second species. 

1 1 . The method of claim 1 0, wherein said plurality of organisms representing 
said second species comprises an F2 population, a F/ population, a F2:3 popidation, or a 

30 Design III population and said one or more relationships between organisms in said 

plurality of organisms representing said second species indicates- which organisms in said 
plurality of organisms representing said second species are m^bers of said F2 
population, said population, said F^ population, or said Design UI population. 



311 



wo 2004/061616 PCT/US2003/041613 

12. The method of claim 1, wherein each said expression value is a normalized 
expression level measurement for said gene G' in an organism in said plurality of 

organisms of said second species. 



5 13. The method of claim 12, wherein each said expression level measurement 

is determined by measuring an amount of a cellular constituent encoded by said gene 
in one or more cells from an organism in said plurality of organisms of said second 
species. 

10 14. The method of claim 13, wherein said amount of said cellular constituent 

comprises an abundance of an RNA present in said one or more cells of said organism. 

15. The method of claim 14, wherein said abundance of said RNA is measured 

■ 

by a method comprising contacting a gene transcript array with said RNA from said one 
15 or more cells of said organism, or with nucleic acid derived from said RNA, wherein said 
gene transcript array comprises a positionaUy addressable surface with attached nucleic 
acids or nucleic acid mimics, wherein said nucleic acids or nucleic acid mimics are 
capable of hybridizing with said RNA species, or with nucleic acid derived from said 
RNA species. 

.20 

.16: The method of claim 12, wherein said normalized expression level 
measurement is obtained by a normalization technique selected from the group consisting 
of Z-score of intensity, median intensity, log median intensity, Z-score standard deviation 
log of intensity, Z-score mean absolute deviation of log intensity, calibration DNA gene 
25 set, user normalization gene set, ratio median intensity correction,, and intensity 
background correction. 

17. The method of claim 1, wherein said first QTL analysis comprises: 

(i) testing for linkage between (a) the genotype of said plurality of organisms of 
30 said second species at a position in the g^ome of said second species and (b) said 

plurality of expression statistics for said gene G'; 

(ii) advancing the position in said genome by an amount; and . 

(iii) repeating steps (i) and (ii) imtil all or a portion of the genome of said second 
species has been tested. 



■ 



312 



wo 2004/061616 



PCTAUS2003/041613 



1 8. The method of claim 1 7, wherein said amount is less than 1 00 
centiMorgans. 

• • • • • ■ • 

5 19. The method of claim 17, wherein said amount is less than 10 

centiMorgans. 

20. The method of claim 17, wherein said amount is less than 5 centiMorgans. 

10 21 . The method of claim 17, wherein said amount is less than 2.5 

centiMorgans. 

22. The method of claim 17, wherein said testing comprises performing 
linkage analysis or association analysis. 

15 

23. The method of claim 22, wherein said linkage analysis or association 
analysis generates a statistical score for said position in the genome of said second • 
species. 

■ • 

20 24. The method of claim 23, wherein said testing is linkage analysis and said 

statistical score is a logarithm of the odds (lod) score. 

25. The method of claim 24, wherein said eQTL is represented by a lod score 
that is greater than 2.0. 

25 

26. The method of claim 24, wherein said eQTL is represented by a lod score 
that is greater than 3.0. 

27. The method of claim 24, wherein said eQTL is represented by a lod score 

* 

30 that is greater than 4.0. 

« * 

28. The method of claim 24, wh^ein said eQTL is represented by a lod score 
that is greater than 5.0. 



313 



wo 2004/061616 PCT/US2003/041613 

29. 'Lh& method of claim 1, wherein said second QTL analysis comprises: 
(i) testing for linkage between (a) the genotype of said plurality of organisms of 

said second species at a position in the genome of said second species and (b) said 
plurality of phenotypic values; 

• * • * * - 

S (ii) advancing the position in said genome by an amount; and 

(iii) repeating steps (i) and (ii) until all or a portion of the genome of said second 
species has been tested. 

30. The method ofclaim 29, wherein said amount is less than 100 
10 centiMorgans. 

'31. The method of claim 29, wherein said amount is less than 1 0 
centiMorgans. 

IS 32. The method of claim 29> wherein said amount is less than S centiMorgans. 

■ » 

■ 

33 . The method of claim 29, wherein said amount is less than 2.5 
centiMorgans. 

20 . 34. The metihod of claim 29, wherein said testing comprises performing 

linkage analysis or association analysis; 

35. The method of claim 34,. wherein said linkage analysis or association 
analysis generates a statistical score for said position in the g^ome of said second 

25 species. 

36. The method of claim 35, wherein said testing is linkage analysis and said 
statistical score is a logarithm of the odds (lod) score. 

* 

30 37. The method of claim 36, wherein said cQTL is represented by a lod score 

that is gfeatCT than 2.0. 

38. The method of claim 36, wherein said cQTL is represented by a lod score 
that is greater than 3.0. 

314 



wo 2004/061616 PCT/US2003/O41613 

39. The method of claim 36, wherein said cQTL is represented by a lod score 
that is greater than 4.0. 



5 40. The mefliod of claim 36, wherein said cQTL is represented by a lod score 

that is greater than 5.0. 



4 1 . The method of claim 1 , wherein said first species is human. 

10 42. The method of claim 1, wherein said second species is a plant or an 

animal. 



43. The method of claim 1, wherein said second species is com, beans, rice, 
tobacco, potatoes, tomatoes, cucumbers, apple trees, orange trees, cabbage, lettuce, or 
15 wheat 



44. The method of claim 1, wherein said second species is a mammal, a 
primate, mice, rats, dogs, cats, chickens, horses, cows, pigs, or monkeys. 

20 45. The method of claim 1, wherein said second species is Drosophila, yeast, a 

■ . ■ 

virus, or Caenorhabditis elegans. 



46. The method of claim 1, wherein said clinical trait T is a complex trait. 

25 47. The method of claim 46, wh^eiu said complex trait T is characterized by 

an allele that exhibits incomplete penetrance in said second species. 

48. The method of claim 46, wherein said complex trait is a disease that is 
contracted by an organism in said plurality of organisms of said second species, and 

30 wherein said organism inherits no predisposing allele to said disease. 

49. The method of claim 46, wherein said complex trait arises when any of a 
plurahty of different genes in the genome of said second species is somatically mutated. 



r 



315 



wo 2004/061616 PCTAJS2003/O41613 

50. The method of claim 46, wherein said complex trait requires ttie 
simultaneous presence of mutations in a plurality of genes in the genome of said second 
species. 

5 51. The method of claim 46, wherein said complex trait is associated with a 

higih frequency of disease-causing alleles in said second species. 

52. The method of claim 46, wherein said complex trait is a phenotype that 
does not exhibit Mendelian recesdve or dominant inheritance attributable to a single gene 

10 locus. 

53. The method of claim 46, wherein said complex trait is asthma, ataxia 
telangiectasia, bipolar disorder, cancer, common late-onset Alzheimer's disease, diabetes, 
heart disease, hereditary early-onset Alzheimer's disease, hereditary nonpolyposis colon 

1 5 cancer, hypertension, infection, maturity-onset diabetes of the young, mellitus, migraine, 
nonalcoholic fatty liver, nonalcoholic steatohepatitis, non-insulin-dependent diabetes 
mellitus, obesity, polycystic kidney disease, psoriases, schizophrenia, or xeroderma 
pigmentosum. 

20 54: The method of claim 1, wherein said eQTL and said cQTL colocalize to 

the same locus in the genome of said second species when the physical location of the 
eQTL in said genome is within 40 cM of the physical location of the cQTL in said 

ft 

genome. 

25 55. The method of claim 1, wherein said eQTL and said cQTL colocalize.to 

the same locus in the g^ome of said second species when the physical location of the 
eQTL in said genome is within 20 cM of the physical location of the cQTL in said 
genome. 

30 56. The method of claim 1, wherein said eQTL and said cQTL colocalize to 

* the same locus in the genoine of said second species when the physical location of the 
eQTL in said genome is within 10 cM of the physical location of the cQTL in said 
genome. 



316 



wo 2004/061616 PCT/US20p3/041613 

57. The method of claim 1, wherein said eQTL and said cQTL colocalize to 
the same locus in the genome of said second species when tiie physical location of the 
eQTL in said graome is within 6 cM of the physical location of the cQTL in said genome. 

5 58. A computer program product for use in conjunction with a computer 

system, the computer program product comprising a computer readable storage medium 
and a computer program mechanism embedded therein, the computer program 
mechanism for associating a gene G in the genome of a first species with a clinical trait T 
exhibited by said first species and a second species, the computer program mechanism 
10 comprising: 

an expression quantitative trait loci (eQTL) identification module for identifying 
an expression quantitative trait loci (eQTL) for a gene G' in said second species that is an 
ortholog of said gene G using a first quantitative trait loci (QTL) analysis, wherein said 
first QTL analysis uses a plurality of expression statistics for said gene G^ as a 
1 5 quantitative trait, wherein each expression statistic in said plurality of expression statistics 
. represents an expression value for said gene G' in an organism in said plurality of 
• organisms of said second species; 

a cUnical quantitative trait loci (cQTL) identiGcation module for identifying a 
clinical quantitative trait loci (cQTL) that is linked to said cUnical trait T using a second 
20 QTL analysis, wherein said second QTL analysis uses a plurality of phenotypic valu^ as 
a quantitative trait, wherein each phenotypic value in said pluraUty of phenotypic values 
rqpresents a phenotypic value for said clinical trait T in an organism in said plurality of 
organisms of said second species; and 

a determination module for determining whether said eQTL and said cQTL 
25 colocahze to the same locus in the genome of said second species, wherein, when said 
eQTL and said cQTL colocalize.to the same locus, said gene G is associated with said 
cUnical trait T in said first species. 

59. A computer system for associating a gene G in the genome of a first 
30 species with a clinical trait T exhibited by said first species and a second species, the 
cbn^utesr system comprising: . . * 

a central processing unit; 



317 



wo 2004/061616 PCT/US2003/041613 

a memory, coupled to the central processing unit, the memory stormg an 
expression quantitative trait loci (eQTL) identification module, a clinical quantitative trait 
loci (cQTL) identification module, and a determination module; wherein 

the expression quantitative trait loci (eQTL) identification module comprises 
5 instructions for identifying an expression quantitative trait loci (eQTL) for a gaie G' in 
said second species that is an ortholog of said gene G using a first quantitative trait loci 
(QTL) analysis, wherein said first QTL analysis uses a plurality of expression statistics 
for gene G' as a quantitative trait, wherein each expression statistic in said plurality of 
expression statistics represents an expression value for said gene G' in an organism in a 
1 0 plurality of organisms of said second species; 

the clinical quantitative trait loci (cQTL) identification module comprises 
instructions for identifying a clinical quantitative trait loci (cQTL) that is linked to said 
clinical trait T using a second QTL analysis, wherein said second QTL analysis uses a 
plurality of phenotypic values as a quantitative trait, wherem each phenotypic value in 
15 said plurality of phenotypic values represents a phenotypic value for said clinical trait T 
in an organism in said plurality of organisms of said second species; and 

the determination module comprises instructions for determining whether said 
eQTL. and said cQTL colocaUze to the same locus in the genome of said second species, 
. wherein, when said eQTL and said cQTL colocalize to the same locus, said gme G is 
20 associated with said clinical trait T. 

60. A method for associating a gene G in flie genome of a first species with a 
clinical trait T exhibited by said first species and a second species, the method 
comprising: 

25 (a) clustering quantitative trait locus data fi"om a plurality of quantitative trait 

locus analyses to form a quantitative trait locus interaction map, wherein 

each quantitative trait locus analysis in said plurality of quantitative trait locus 
analyses is performed for a gene in a plurality pf genes in the genome of said second 
species using a genetic marker map and a quantitative trait in order to produce said 

30 quantitative trait locus data, wherein, for each quantitative trait locus analysis, said 

quantitative trait comprises an e3q)ression statistic for the gene for which the quantitative 
trait locus analysis is performed, for each organism in a plurality of organisms of said 
second species; and wherein 



318 



wo 2004/061616 PCTAJS2003/041613 

said genetic marker map is constructed Horn a set of genetic markers associated 
with said plurality of organisms of said second species; and 

(b) analyzing said quantitative trait locus interaction map to identify a gene G' of 
said second species tiiat is associated with said cUnical trait T, wherein said gene G' is an 
5 ortholog of said gene G in said first species, thereby associating a gene G in the genome 
of said first species with said clinical trait T. 

61 . The method of claim 60» which further comprises, prior to said clustering 
step, a step of performing each said quantitative trait locus analysis in said plurality of 

10 quantitative trait locus analyses. 

62. The method of claim 60, wherein said expression statistic for said gene G^ 
is computed by a method comprising transforming an expression level measurement of 
said gene G' firom each organism in said pluraUty of organisms of said second species. 

15 

63. The method of claim 60; wherein each said quantitative trait locus analysis 
comprises: 

(i) testing for Linkage between a position in a chromosome, in the genome of said 
second species, and the quantitative trait used in the quantitative trait locus analysis; 
20 (ii) advancing the position in said chromosome by an amount; and 

(iii) repeating steps (i) and (ii) imtil all or a portion of the genome has been tested. 

■ 

64. The method of claim 63, wh^ein said quantitative trait locus data 
produced firom each respective quantitative trait locus analysis comprises a logarithmic of 

25 the odds score computed at each saidposition. . 

65. The method of claim 63, wherein said testing comprises performing 
linkage anal}^ or association analysis. 



30 66. The me&od of claim 60, wherein said clustering of the quantitative trait 

lck:us data fix>m each said quantitative tiait locus analysis comprises applying a . 
hierarchical clustering technique, applying a k-means technique, applying a fuzzy k- 
means technique, applying a Jands-Patiick clustering, applying a self-organizing map 
technique, or applying a neural network technique. 



319 



wo 2004/061616 PCT/US2003/041613 

67. The method of claim 62, which further comprises constructing a gene 
expression clust^ map firom each expression statistic created by said transforming step. 

5 68. The method of claim 67, wherein said constructing a gene expression 

cluster map cort^ses: 

creating a plurality of g^e expression vectors, each gene expression vector in said 
plurality of gene expression vectors represmting an expression level measurement of a 
gene, in said plurality of genes in the genome of said second species, in each of the 
10 plurality of organisms of said second species; 

computing a plurality of correlation coefficients, wherein each correlation 
coeffici^t in said plurality of correlation coefficients is computed between a gene 
expression vector pair in said plurality of gene expression vectors; and 

clustering said pluraUty of gene expression vectors based on said plurality of 
. IS correlation coefficients in order to form said gene expression cluster map. 

69. The method of claim 68, wherein said step of analyzing said quantitative . 
• trait locus interaction m^ comprises filtering the quantitative trait locus interaction map 

in order to obtain a candidate pathway group; wherein the filtering comprises identifying ' 
20 a quantitative trait locus in said candidate pathway group in said gene expression cluster • 
map. 

70. The method of claim 67, wherein said constructing a gene expression 
cluster map comprises: 

25 creating a plurality of gene expression vectors, each gene expression vector in said 

plurality of gene expression vectors representing a gene in said plurality of genes; 

computing a plurality of metrics, wh^ein each metric in said plurality of metrics 
is computed between a gene egression vector pair in said plurality of gene expression 
vectors; and 

30 clustering said plurality of gene expression vectors based on said plurality of 

metrics in order to form said gene expression cluster map. 

7 1 . The method of claim 60, wherein said plurality of genes comprises at least 
five genes. 



320 



wo 2004/061616 



PCT/US2003/041613 



72. A computer program product for use in conjunction with a computer 
system, the computer program product comprising a computer readable storage medium 
and a computer program mechanism embedded therein, the computer program 

5 mechanism comprising: 

a clustering module for clustering quantitative trait locus data £rom a plurality of 
quantitative trait locus analyses to form a quantitative trait locus interaction map; wherein 

eadi quantitative trait locus analysis in said plurality of quantitative trait locus 
analyses is performed for a gene in a plurality of genes in the genome of a second species 
1 0 using a genetic marker map and a quantitative trait in order to produce said quantitative 
trait locus data, wherein, for each quantitative trait locus analysis, said quantitative trait 
comprises an expression statistic for the gene for which the quantitative trait locus 
analysis is performed, for each organism in a plurality of organisms of said second 
species; and wherein 

IS said genetic marker map is constructed &om a set of genetic markers associated 

with said plurality of organisms of said second species; 

an analysis module for analyzing said quantitative trait locus interaction map to 
* identify a gene in said second species that is associated with a clinical trait T exhibited 
by a first species and said second species, wherein said gene G' is an ortholog of a gene 
20 G in said first species. 

73. A computer system for associating a gene G in the genome of a first 
species with a clinical trait T exhibited by said first species and a second species, the 
computer system comprising: 

25 a central processing unit; 

a memory, coupled to the central processing unit, the memory storing a clustering 
module, an analysis module and an ortholog identification module; 

the clustering module for clustmng quantitative trait locus data fi:om a plurality of 
quantitative trait locus analyses to form a quantitative trait locus interaction map; wherein 

30 each quantitative trait locus analysis in said plurality of quantitative trait locus 

anal^es is performed for a gene in a plurality of genes in the genome of said second 
species using a genetic marker map and a quantitative trait in order to produce said 
quantitative trait locus data, wherein, for each quantitative trait locus analysis, said 
quantitative trait conq>rises an expression statistic for the gene for which the quantitative 



321 



wo 2004/061616 PCT/US2003/041613 

trait locus .analysis is performed, tor each organism m a plurality ot orgamsms ot said 
second species; and wherein 

said genetic marker map is constructed from a set of genetic markers associated 
with said plurality of organisms of said second species; 
5 the analysis module for analyzing said quantitative trait locus interaction map to 

identify a gene of said second species that is associated with a clinical trait T exhibited 
by said first species and said second species, wherein said gene G' is an ortholog of said 
gene G of said first species. 

10 74. A method for identifying a quantitative trait locus for a complex trait in a 

first species, wherein the complex trait is exhibited by said first species and a second 
species, the method comprising: 

(a) dividing a pluraUty of organisms of said second species into a plurality of 
subpopulations using a classification scheme that classifies each organism in said 

1 5 plurality of organisms of said second species into at least one of said subpopulations, 
wherein said classification scheme uses a plurality of cellular constituent measurements 
fiiom each said organism of said second ^ecies; and 

(b) for at least one subpopulation in said plurality of subpopulations, performing 
quantitative genetic analysis on said subpopulation in order to identify a quantitative trait . 

20 locus for said complex trait in said second species, wherein said quantitative trait locus 
for said complex trait in said second species is an ortholog of the quantitative trait loci in 
said first species, thereby identifying said quantitative trait locus for said complex trait in 
said first species. 

25 75. The method of claim 74, wherein said complex trait is a disease that is 

contracted by an organism in said first species or said second species, and wherein said 
organism inherits no predisposing allele to said disease. 

76. The method of claim 74; wherein said complex trait arises when any of a 
30 plurality of different genes in the genome of said first species or said second species is 

-. mutatea. .... ... 

■ 

77. The method of claim 74, wherein said complex trait is associated with a 
higji frequency of disease-causing alleles in said first species or said second species. 



322 



wo 2004/061616 



PCT/US2003/041613 



78. The method of claim 74, wherein said coinplex trait is a phenotype that 
does not exhibit Mendelian recessive or dominant inheritance attributable to a gene locus. 



heart disease, hereditary early-onset Alzheimer's disease, hereditary nonpolyposis colon 
cancer, hypertension, infection, maturity-onset diabetes of the young, mellitus, migraine, 
nonalcoholic fatty liver, nonalcoholic steatohepatitis, non-insulin-dependent diabetes 
10 mellitus, obesity, polycystic kidney disease, psoriases, schizophrenia, or xeroderma 
pigmentosum. 



80. The method of claim 74, wherein said plurality of cellular constituent 
measurements from each said organism of said second species comprises the 
15 nieasurement of the cellular constituent levels of ten or more cellular constituents in each 
said organism. 



subpopulation in said plurality of subpopulations; and 

when a class predictor is not available, using an imsupervised classification 
scheme to classify each organism in said plurality of organisms of said second species. 
25 into a subpopulation in said plurality of subpopiilations. 

82. The method of claim 74, wherein said classification scheme is a supervised 
classification scheme. 

30 83. The method of claim 74, wherein said classification scheme is an 

' imsupervised claissificatioh scheme. 

84- The method of claim 83, wherein said unsupervised classification scheme 
is a hierarchical cluster analysis that uses a nearest-neighbor algorithm, a farthest- 



5 




20 



81. The method of claim 74, wherein said dividing comprises determining 
whether a class predictor is available, and 

when a class predictor is available, using a sxq)OTvised classification scheme to 
classify each orgaoism in said plurality of organisms of said second species into a 



323 



wo 2004/061616 PCTAJS2003/041613 

neighbor algorithm, an average linkage algorithm, a centroid algorithm, or a svmi-ot- 
squares algorithm to determine the similarity between (i) the plurality of cellular 
constituent measurements from one organism in said plurality of organisms of said 

second species and (ii) the plurality of cellular constituent measurements from another 

* * 

5 organism in said plxurality of organisms of said second species. 

85. A computer program product for use in conjunction with a computer 
system, the computer pro-am product comprising a computer readable storage medium 
and a computer program mechanism embedded therein, the computer program 

10 mechanism comprising: 

a classification module for dividing a plurality of organisms of a second species 
into a plurality of subpopulations using a classification scheme that classifies each 
organism in said plurality of organisms of said second species into at least one of said 
subpopulations, wherein said classification scheme uses a plurality of cellular constituent 

15 measurements from each said organism in said second species; 

a genetic analysis module that, for at least one subpopulation in said plurality of 
subpopulations, performs quantitative genetic analysis on said subpopulation in order to 
identify a quantitative trait locus for a complex trait that is exhibited by said second 
species and a first species, wherein said quantitative trait locus for said complex trait that 

20 is exhibited by said second species is the oriholog of the quantititative trait locus of said 
first spedes. 

* 

86. A computer system for identifying a quantitative trait locus for a complex 
trait in a first species, wherein the complex trait is exhibited by said first species and a 

25 second species, the computer system comprising: 

a central processing unit; 

a memory, coupled to the central processing unit, the memory storing a 
classification module, and a genetic analysis module; wherein 

the classification module includes instructions for dividing a pluraUty of 
30 organisms of a second species into a plurality of subpopulations using a classification 

scheme that classifies each organism in said plurality of organisms of said second species, 
into at least one of said subpopulations, wherein said classification scheme uses a 
plurality of cellular constituent measurements fi^m each said organism in said second 
species; 



324 



wo 2004/061616 PCT/US2003/041613 

the genetic analysis module includes instructions that^ for at least one 
subpopulation in said plurality of subpopulations» performs quantitative genetic analysis 
on said subpopulation in order to identify said quantitative trait locus in said second 
species for said complex trait, wherein the quantitative trait locus in said second species is 
5 the ortholog of the quantitative trait locus in said first species. 



87. A method for determining whether a candidate molecule affects a body 
weight disord^ associated with an organism, comprising: 

(a) contacting a cell from said organism with, or recombinantly expressing within 
10 the cell from said organism, said candidate molecule; 

(b) detennining whether the RNA expression or protein e7q)ression in said cell of 
at least one open reading frame is changed in step (a) relative to the expression of said 
open reading frame in the absence the candidate molecule, each said open reading frame 
being regulated by a promoter native to a nucleic acid sequence selected £rom the group 

15 consisting of SEQ ID NO: 1, SEQ ID NO: 2, SEQ ID NO: 3, SEQ ID NO: 9, SEQ ID 
NO: 12, SEQ ID NO: 13, SEQ ID NO: 16, SEQ ID NO: 19, SEQ ID NO: 20, and 
homologs of each of the foregoing; and . 

(c) detennining that the candidate molecule affects a body weight disorder 
associated with said organism when the RNA expression or protein expression of said at 

20 least one open reading frame is changed, or 

detennining that the candidate molecule does not affect a body weight disorder 
associated with said organism when the KNA expression or protein expression of said at 
least one open reading fiame is unchanged. 

25 88. The method of claim 87 wherein a cell from said organism contacted with 

the candidate molecule exhibits a lower expression level of a protein sequence selected 
from the group consisting of SEQ ID NO: 4, SEQ ID NO: 5, SEQ ID NO: 6, SEQ ID NO: 
7, SEQ ID NO: 8, SEQ ID NO: 10, SEQ ID NO: 1 1, SEQ ID NO: 14, SEQ ID NO: 15, 
SEQ JD NO: 17, SEQ ID NO: 18, SEQ ID NO: 21, SEQ ID NO: 22, SEQ ID NO: 23, 

30 SEQ ID NO: 24, SEQ ID NO: 25, SEQ ID NO: 26, SEQ ID NO: 27, SEQ ID NO: 28, and 
•SEQ ID NO: 29 than a cell from said organism that is not contacted with said candidate ' 
molecule. 



325 



wo 2004/061616 PCT/US2003/041613 

89. The method of claim 87, wherein step (b) comprises determining whethCT 
SNA ^qjression is changed. 

90. The method of claim 87, wherein step (b) comprises determining whether 
S protein expression is changed. 

91 . The method of claim 87, wherein step (b) comprises determining whether 
RNA or protein expression of at least two of said open reading frames is changed. 

10 92. The method of claim 87, wherein step (a) comprises contacting the cell 

with the candidate molecule, and wherein step (a) is carried out in a liquid high 
throughput-like assay. 

93. The method of claim 87, wherein the cell comprises a promoter region of 
15 at least one gene selected from the group consisting of SEQ ID NO: 1, SEQ ID NO: 2, 

SBQ ID NO: 3, SEQ ID NO: 9, SEQ ID NO: 12, SEQ ID NO: 13, SEQ ID NO: 16, SEQ 
ID NO: 19, SEQ ID NO: 20, and homologs of each of the foregoing, each promoter 
region being operably linked to a marker gene; and wherein step (b) comprises 
. determining whether the RNA expression or protein expression of the marker gene(s) is 
20 changed in step (a) relative to the expression of said marker gene in the absence of the 
candidate molecule. 

w • * 

94. The method of claim 93, wherein the marker gene is selected from the 

s 

group consisting of green fluorescent protein, red fluorescent protein, blue fluorescent 
25 protein, luciferase, LEU2, LYS2, ADE2, TRPl, ,CAN1, CYH2, GUS, CUPl and 
chloramphenicol acetyl transferase. 

95. The method of claim 87, wherein said body weight disorder is obesity, 
anorexia nervosa, bulimia nervosa or cachexia. 

30 

- 96. A me&od of identifying a molecule that specifically binds to'a ligand " 
selected from die group consisting of (i) a protein encoded by a gene selected fix>m tiie 
group consisting of SEQ ID NO: 1, SEQ ID NO: 2, SEQ ID NO: 3, SEQ ID NO: 9, SEQ 
ID NO: 12, SEQ ID NO: 13, SEQ ID NO: 16, SEQ ID NO: 19, SEQ ID NO: 20, and 



326 



wo 2004/061616 PCT/US2003/041613 

homologs of each of the foregoing, and (ii) a biologically active fragment of SEQ ID NO: 
4, SEQ ID NO: 5, SEQ ID NO: 6, SEQ ID NO: 7, SEQ ID NO: 8, SEQ K) NO: 10, SEQ 
ID NO: 11, SEQ ID NO: 14, SEQ ID NO: 15, SEQ ID NO: 17, SEQ ID NO: 18, SEQ ID 
NO: 21, SEQ ID NO: 22, SEQ ID NO: 23, SEQ ID NO: 24, SEQ ID NO: 25, SEQ ID 
5 . NO: 26, SEQ ID NO: 27, SEQ ID NO: 28, and SEQ^ NO:'29, the method comprising: 

(a) contacting the ligand with one or more candidate molecules under conditions 
conducive to binding between the ligand and the candidate molecules; and 

(b) identifying a molecule within the one or more candidate molecules that binds 
to the hgand. 

10 

97. A purified protein comprising the amino acid sequence of SEQ ID NO: 8. 

98. A purified protein encoded by a nucleic acid hybridizable under conditions 
of high stringency to a DNA having a sequence consisting of the coding region of SEQ 

15 ID NO: 2. 

99. A purified protein comprising an amino acid sequence that has at least 
90% identity to the amino acid sequence set forth in SEQ ID NO: 8, in which percentage 

• identity is determined over an amino acid sequence of identical size as SEQ ID NO: 8. 

20 

1 00. A purified protein comprising an amino acid sequence that has at least 
95% identity to the amino acid sequence set forth in SEQ ID NO: 8, in which percentage 
identity is determined over an amino acid sequence of identical size as SEQ ID NO: 8. 

25 101. An isolated nucleic acid comprising the nucleotide sequence of SEQ ID 

NO: 2, a coding region of SEQ ID NO: 2, SEQ ID NO: 3, a coding region of SEQ ID NO: 
3, or the complement of any of the foregoing. 

1 02. The isolated nucleic acid of claim 101 that is a DNA. 

30 

103. An isolated nucleic acid comprising a niicleotide sequence encoding the ' 
protem of any one of claims 97-100, or the complement thereof. 



327 



wo 2004/061616 PCTAJS2003/041613 

1 04. A recombinant cell containing the nucleic acid of claim 1 01 , in which the 
nucleotide sequence is under the control of a promoter heterologous to the nucleotide 
sequence. ' 



10 



15 



105. A recombinant cell containing a nucleic acid vector that comprises the 
nucleic acid of claim 101. 



106. An antibody that binds to a protein consisting of the amino acid sequence 
ofSEQIDNO: 8. 



1 07. The antibody of claim 1 06 that is monoclonal. 

108. A molecule comprising a fragment of the antibody of claim 1 06, which 
firagment binds a protein consisting of the amino acid sequence of SEQ ID NO: 8. 



109. A method of producing protein comprising: 

growing a recombinant cell containing the nucleic acid of any one of claims 101- 
103 in which said nucleic acid sequence is under the control of a promoter heterologous 
to said nucleotide sequence, such that the-protein encoded by said nucleic acid is 
20 expressed by the cell; and 

' recovering the expressed protein. 

110. An isolated protein that is the product of the process of claim 109. 

25 11 1 . A pharmaceutical composition comprising a therapeutically effective, 

amount of the protein of any one of claims 97-103, and a pharmaceutically acceptable 
carrier. 



112. A pharmaceutical composition comprising a therapeutically effective 
30 amount of the nucleic acid of any one of claims 101-103; and a pharmaceutically 
acceptable carriCT. - ; ; 



328 



wo 2004/061616 PCT/US2003/041613 

113. A phamaceutical composition comprising a thierapeutically effective 
amonat of the recombinant cell of claim 104 or claim 105; and a pharmaceutically 
acceptable carrier. 



5 1 14. A phamaceutical composition comprising a therapeutically effective 

amount of an antibody that binds to a protein comprising the andno acid sequence of any 
one of claims 97-100, and a phaimaceutically acceptable carrier. 

115. A method of treating or preventing a body weight disorder comprising 
10 administering to a subject in which treatment is desired a therapeutically effective amoimt 
of a molecule that antagonizes in the subject a protein comprising SEQ ID NO: 8, SEQ 
ID NO: 1 1, SEQ ID NO: 15, SEQ ID NO: 18, SEQ ID NO: 25, SEQ ID NO: 26, or SEQ 
ID NO: 27. 

15 116. The method of claim 115 wherein said subject is hunian. 

117. The method of claim lis in wMch the molecide that inMbits a fi^ 
one or more of the groiq) consisting of SEQ ID NO: 8, SEQ ID NO: 11, SEQ ID NO: 15, 
SEQ ID NO: 18, SEQ ID NO: 25, SEQ ID NO: 26, and SEQ ID NO: 27 is selected from 

20 the groi^ consisting of an antibody that binds to one of SEQ ID NO: 8, SEQ ID NO: 1 1 , 
SEQ ID NO: 15, SEQ ID NO: 18, SEQ ID NO: 25, SEQ ID NO: 26, and SEQ ID NO: 27 
or a fragment or derivative therefore contaimi^ the binding region tiiereo:^ a nucleic acid 
complementary to the RNA produced by transcription of a g^e encoding one of SEQ ID 
NO: 8, SEQ DD NO: 1 1, SEQ ID NO: 15, SEQ ID NO: 18, SEQ ID NO: 25, SEQ ID NO: 

25 26, and SEQ ID NO: 27. 



118. The method of claim 1 1 5 in which the molecule that inhibits a function of 
one or more of the group consisting of SEQ ID NO: 8, SEQ ID NO: 11, SEQ ID NO: 15, 
SEQ ID NO: 18, SEQ ID NO: 25, SEQ ID NO: 26, and SEQ ID NO: 27 is an 
30 oligonucleotide that 

- (a) consists of at -leiast six nucleotides;" " ' -■ 
(b) conq)rises a sequence conq>lranentary to at least a portion of an RNA 
transcript of a gene encoding one of SEQ ID NO: 8, SEQ ID NO: 11, SEQ ID NO: 15, 
SEQ ID NO: 18, SEQ ID NO: 25, SEQ ID NO: 26 or SEQ ID NO: 27; and 

329 



wo 2004/061616 PCT/US2003/041613 

(c) is hybiidizable to said RNA transcript under moderately stringent conditions. 

119. A method of treating or preventing a body weight disorder comprising 
administering to a subject in which treatment is desired a therapeutically effective amount 

5 of a molecule that enhances a function of one or more oif the group consisting of SEQ ID" 
NO: 8, SEQ ID NO: 11, SEQ ID NO: 15, SEQ ID NO: 18, SEQ ID NO: 25, SEQ ID NO: 
26, and SEQ ID NO: 27. 

120. The method of claim 1 19 wherein said subject is human. 

10 

121. A method of diagnosing a disease or disorder or the predisposition to said 
disease or disorder, wherein the disease or disorder is characterized by an aberrant level 
of one of SEQ BD NO: 1 through SEQ ID NO: 29 in a subject, the method comprising 
measuring the level of any one of SEQ ID NO: 1 through SEQ ID NO: 29 in a sample 

15 derived from the subject, in which an increase or decrease hi the level of one of SEQ ID 
NO: 1 through SEQ ID NO: 29 in said sample, relative to the level of one of said SEQ ID 
NO: 1 through SEQ ID NO: 29 found in an analogous sample not having the disease or 
disorder, indicates the present of the disease or disorder in the subject 

t m 

20 ' 122. The method of claim 121 wherein the disease or disorder is a body weight 
disorder. 

I 

123- The method of claim 122 wherein the body weight disorder is obesity, 
anorexia nervosa, bulimia nervosa, or cachexia. 

25 

1 24. A method of diagnosing or screening for the presence of or predisposition 
for developing a disease or disorder involving a body weight disorder in a subject 
comprising detecting one or more mutations in at least one of SEQ ID NO: 1 through 
SEQ ID NO: 29 in a sample derived from the subject, in which the presence of said one 

30 or more mutations indicates the presence of the disease or disorder or a predisposition for 
developing said disease or disorder. 

125. A recombinant non-human animal that is the product of a process 
cong}rising introducmg a nucleic acid encoding at least a domain of one of SEQ E) NO: 

330 



y/O 2004/061616 PCT/US2003/041613 

8, SEQ ID NO: 15, SEQ ID NO: 18, SEQ ID NO: 25, SEQ ID NO: 26, and SEQ ID NO: 
27 into the non-human animal. 

126. A method for confirming the association of a query QTL or a query gene 
5 in the genome of a second species with a chnical SSTX exhibited by said second species, 
the method comprising: 

(a) mapping (i) a region of the genome of a first species that comprises a first QTL 
or a first gene in said first species that is linked to a trait T' to (ii) a region of the genome 
of said second species, wherein trait T' is indicative of trait T; and 
10 (b) finding a query QTL or a query gene in said second species that is potentially 

associated with said trait T, wherein the potential association of said query QTL or said 
query gene with said clinical trait T is confirmed when said query QTL or said query 
gene is in said region of the genome of said second species. 

■ 

15 127. The method of claim 126, the method further comprising, prior to said 

mapping step (a), a step of finding said first QTL or said first gene in said first species 

comprising:- • 

(i) crossing a first strain and a second strain of said first species in order to obtain 

a segregating population; 
20 . ' (ii) stratifying said segregating population into a pluraUty of subpopulations, 

wherein a subpopulation in said plurality of subpopulations represents a phenotypic 

extreme of said trait T'; 

(iii) using cellular constituent measurements firom organisms in the plurality of 
subpopulations to identify a cellular constitu^t set that exhibits a cellular constituent 

25 measurement pattern associated with said phenotypic extreme; 

(iv) clustering said segregating population based on measiirements of said cellular 
constituent set in organisms in said segregating population to obtain a plurality of 
population clusters; and 

(v) for at least one population cluster in said pluraUty of population clustCTS, 
30 performing quantitative genetic analysis on said population cluster in order to find said 

first QTL or said first gene in said first species that is linked to said trait T^ 

128. The method of claim 127 wherein said cellular constituent measurements 
are transcriptional state measurements or translational state measurements. 



331 



wo 2004/061616 



PCT/US2003/041613 



129. The method of claim 127 wherein said cellular constituent measurements 
are translaiional state measurements that are performed using an antibody array or two- 
dimensional gel electrophoresis. 

• - • • - - ... 

5 

130. The method of claim 127 wherein said cellular constituent set comprises a 
plurahty of metabolites and said plurality of cellular constituent measurements are 
derived by a cellular phenotypic technique. 

10 131 . The method of claim 130 wherein said cellular phenotypic technique 

comprises a metabolomic technique wherein a plurality of levels of metabolites in one or 
more organisms in said segregating population is measured. 

132. The method of claim 131 wherein said metabolites comprise an amino 
1 5 acid, a metal, a soluble sugar, or a complex carbohydrate. 

t 

133. The method of claim 127 wherein said cellular constituent measurements 
comprise gene expression levels, abundance of mRNA, protein expression levels, or 
metabolite levels- 

20 

134. The method of claim 126, the method further conqirising, prior to said 
mapping step (a), a step of finding said first QTL or said first gene in said Sxst species 
comprising: 

(i) crossing a first strain and a second strain of said first species in order to obtain 
25 a segregatiQg population; 

(ii) dividing said population into a plurality of subpopulations using a 
classification scheme that classifies each organism in said segregating population into at 
least one of said subpopulations, wherein said classification scheme uses cellular 
constituent measurements of a plurality of cellular constituents from each said organism; 

30 and 

(iii) for at least one subpopulation in said plurality of subpopulations, performing 
quantitative genetic analysis on said subpopulation in order to find said first QTL or said 
first gene in said first species that is linked to trait T'. 



332 



wo 2004/061616 PCT/nS2003/041613 

135. The method of claim 134 wherein said cellular constituent measurements 
are transcriptional state measurements or translational state measurements. 

* 

136. The method of claim 134 wherein said cellular constituent measurements 
5 are translational state measurements that are performed using an antibody array or two- . 

dimensional gel electrophoresis. 

137. The method of claim 134 wherein said pluraUty of cellular constituents 
comprise a plurality of metabolites and said plurality of cellular constituent measurements 

.10 are derived by a cellular phmotypic technique. 

138. The method of claim 137 wherein said cellular phmotypic technique 
comprises a metabolomic technique wherein a plurality of levels of metaboHtes in each 
said organism is measured. 

15 

139. The method of claim 138 wherein said metabolites comprise an amino 
acid, a metal, a soluble sugar, or a consplex carbohydrate. 

■I • 

140. The method of claim 134 wherein said cellular constituent measurements 
20 of said plurality of cellular constituents comprise gene expression levels, abundance of 

. mRNA, protein expression levels, or metabolite levels. 

141 . The method of claim 126, the method fiirther comprising, prior to said 
mapping step (a), a step of fbading said first QTL or said first gene in said first species 

25 comprising: 

(i) generating a set of congenic organisms that span all or a portion of the genome 
of said first species using a background strain and a donor strain; and 

(ii) . identifying those strains in said set of congenic organisms that exhibit trait T'. 

30 142. The method of claim 126 wherein said mapping step (a) is based upon a 

syntenic map between said first species and said second species. 

143. The method of claim 126 wherein said finding step (b) comprises 
pecfoiming quantitative genetic analysis on a population of said second species. 



333 



wo 2004/061616 



PCTAJS2003/041613 



15 



30 



144. The method of claim 126 wherein said clinical txait T is asthma, ataxia 
telangiectasia, bipolar disorder, cancer, coromon late-onset Alzheimer's disease, diabetes, 
heart disease, hereditary early-onset Alzheimer's disease, hereditary nonpolyposis colon 
cancer, hypertension, infection, maturity-onset diabetes of the young, mellitus, migraine, 
nonalcoholic fatty liver, nonalcohoUc steatohepatitis, non-insulin-dependent diabetes 
mellitus, obesity, polycystic kidney disease, psoriases, schizophrenia, or xerodenna 
pigmentosum. 



10 14S. The method of claim 126 wherein said quantitative genetic analysis is 

performed using a method that uses one or more techniques selected from the group 
consisting of linkage analysis, a quantitative trait locus (QTL) analysis that uses a 
plurality of cellular constituent measurements as a phenotypic trait, and association 
analysis. 



146. The method of claim 145 whereiu said first QTL is represented by a lod 
score that is greater than 3.0. 



147. The method of claim 145 wherein said first QTL is represented by a lod 
20 score that is greater than 4.0. 

■ 

148. The method of claim 127 wherein said quantitative genetic analysis is 
performed using a method that uses one or more techniques selected from the group 

consisting of linkage analysis, a quantitative trait locus (QTL) analysis that uses a 
25 plurality of cellular constituent measurements as a phenotypic trait, and association 
analysis. 



149. The method of claim 148 wherein said first QTL is represented by a lod 
score that is greater than 3.0. 



150. The method of claim 148 wherein said first QTL is represented by a lod 
score that is greater than 4.0. 



334 



wo 2004/061616 PCT/US2003/041613 

151. The method of claim 134 wherein said quantitative genetic analysis is 
perfomied using a method tiiat uses one or more techniques selected from the group 
consisting of liukage analysis, a quantitative trait locus (QTL) analysis fliat uses a 
plurality of cellular constituent measurements as a phenotypic trait, and association 

5 analysis. 

1 52. The method of claim 151 wherein said first QTL is represented by a lod 
score that is greater than 3.0. 

10 153. The method of claim 151 wherein said iBrst QTL is represented by a lod 

score that is greater than 4.0. 

1 54. The method of claim 126 wherein said second species is human. 

■ 

IS 155. The method of claim 126 wherein said clinical trait T is obesity and said 

trait T' is high density lipoprotein level, low density lipoprotein level, very low density 
Upoprotein level, free fatty acid level, fat pad mass, or weight/heigfht ratio. 

156. The method of claim 126 wherein said region of the genome of said first 
20 species is a portion of a chromosome. 

* 

157. The method of claim 126 wherein said region of the genome of said first 
species is less than 1 00 centiMorgans. 

25 158. The method of claim 126 wherein said region of the genome of said first 

species is less than 10 centiMorgans. 

159. The method of claim 126 wherein said region of the genome of said first 
species is less than 5 centiMorgans. 

30 

160. The melhod ofcldml, the method finlher coiiipri^ 

(d) validating said association between said gene G and said clinical trait T by 
testing for genetic linkage between said expression quantitative trait loci (eQTL) and said 
clinical quantitative trait loci (cQTL). 



335 



wo 2004/061616 



PCT/US2003/O41613 



161. The method of claim 160 wherein said testing for genetic linkage comprises 
marker-difference regression or a multiple-trait extension of composite interval mapping. 

5 ' 162. ITie computer program product of claim 58, the computer program 

mechanism further comprising instructions for validating said assocation between said 
gene G and said clinical trait T by testing for genetic linkage between said ^ression 
quantitative trait loci (eQTL) and said clinical quantitative trait loci (cQTL). 

10 1 63. The computer program product of claim 162 wherein said testing for genetic 

linkage comprises marker-difference regression or a multiple-trait extension of composite 
interval mapping. 

164. The computer system of claim 59, the computer the memory further 
IS comprising instructions for validating said assocation between said gene G and said 

clinical trait T by testing for genetic linkage between said e3q)ression quantitative trait 
' • lod (eQTL) and said clinical quantitative trait loci (cQTL). 

■ 

165. The computer system of claim 164 wherein said testing for genetic linkage 
20 comprises marker-difference regression or a multiple-trait extension of composite interval 

mapping. 

r 

166. The method of claim 1, the method further comprising validating an 
association between said gene G and said clinical trait T. 

25 

167. The method of claim 166 wherein said vaUdatuig comprises suppressing said 
gene G using an RNAi technique and establishing that said suppression of said geae G 
affects said eQTL. 

30 1 68. A method of identifying a molecular target for a second trait in a second 

species, the me&od comprising: 

(a) identifying a first gene in a segregating population that is causal for a first trait 
exhibited by all or a portion of said segregating population, wherein each member of said 



336 



wo 2004/061616 PCT/US2003/041613 

segregating population is a member of a first species and wherein said second trait in said 
second species corresponds to said first trait in said first species; 

(b) mapping said first gene in said first species to a corresponding locus in the 
genome of the second species; and 
5 ' (c) determining whether a marker or a haplotype in said corresponding locus in 

the genome of the second species associates with said second trait, wherein, wh^ said 
marker or said haplotype associates with said second trait in said second species, said 
locus is identified as said molecular target 

10 169. The method of claim 168 wherein said marker or said haplotype is in a 

second gene in said corresponding locus and said second gene is identified as said 
molecular target 

170. The method of claim 169 wherein said first gene and said second gene are 
15 orthologous. 

171. The method of claiim 168 wherein said id.entifying said first gene in said 
segregating population that is causal for said first trait exhibited by all or a portion of said 
segregating population comprises: 

20 (a) identifying a test gene in said first species that has at least one abimdance 

quantitative trait locus (eQTL) coincident with a req)ective cUnical quantitative trait locus 
(cQTL) for said first trait; and 

(b) testing, for one or more respective eQTL in said at least one eQTL, whether (i) 
the genetic variation of said eQTL across said segregating population and (ii) the 

25 variation of the first trait across said segregating population are correlated conditional on 
an abundance pattern of the test gene across said segregating population, 

wherein, when the genetic variation of (1) said one or more respective eQTL 
tested in step (b) and (2) the variation of the first trait across said segregating population 
are correlated conditional on an abundance pattern of the test gene across said segregating 

30 population, said test gene is identified as said first gene. 

172. The method of claim 168 wherein said second species is mammalian. 

173. The method of claim 168 wherein said second species is human. 



337 



wo 2004/061616 



PCTAJS2003/041613 



174. The mefhod of claim 1 68 wherein said second trait is asthma, ataxia 
telangiectasia, bipolar disorder, cancer, coromon late-onset Alzheimer's disease, diabetes, 
heart disease, hereditary early-onset Alzheimer's disease, hereditary nonpolyposis colon 
5 ' cancer, hypertension, infection, maturity-onset diabetes of the young, mellitus, migraine, 
nonalcoholic &tty liver, nonalcoholic steatohepatitis, non-insulin-dependent diabetes 
mellitus, obesity, polycystic kidney disease, psoriases, schizophrenia, or xeroderma 
pigmentosum. 

10 . 175. The method of claim 168 wherein said molecular target is a gene. 

176. The method of claim 168 wherein said molecular target is an exon, an intron, 
or a regulatory element of a gene. 

IS 177. The method of claim 168 wherein said marker is a single nucleotide 

polymoiphism, a microsateUite marker, a restriction fragment length polymoiphism, a 
short tand^ repeat, a DNA methylation' marker, a sequrace leng& polymorphism, a 
random amplified polymoiphic DNA, an amplified firagment length polymoiphisms, or a 

■ 

suiq)le sequence repeat 

20 

* 

178. A method of identifying a molecular target for a second trait in a second 
species, the method comprising: 

(a) identifying a first gene in a segregating population that is causal for a first trait 
exhibited by all or a portion of said segregating population, wherein each member of said 

25 segregating population is a member of a first species and wherein said second trait in said 
second species corresponds to said first trait in said first species; 

(b) identifying a locus in the genome of the second species that is (1) linked to 
said second trait and (2) maps to the position in the genome of said first species where 
said first gene resides; and 

30 (c) determining whether a marker or a haplotype in said corresponding locus in 

the genome of the second species associates with said second trait, wherein, when said 
marker or said haplotype associates with said second trait in said second species, said 
locus is identified as said molecular target. 



338 



wo 2004/061616 PCT/US2003/041613 

179. The method of claim 178 wherein said marker or said h^lotype is in a 
second gene in said corresponding locus and said second gene is identified as said 
molecular target. 



5 180. The method of claim 179 wherein said first gene and said second gene are 

orthologous. 

181 . The method of claim 178 wherein said identifying said first gene in said 
segregating population that is causal for said first trait exhibited by all or a portion of said 

10 segregating population comprises: 

(a) identifying a test gene in said first species that has at least one abundance 
quantitative trait locus (eQTL) coincident with a respective clinical quantitative trait locus 
(cQTL) for said first trait; and 

(b) testing, for one or more respective eQTL in said at least one eQTL, whether (i) 
15 the genetic variation of said eQTL across said segregating population and (ii) the 

variation of the first trait across said segregating population are correlated conditional on 
an abundance pattern of the test gene across said segregating population, 

wherein, when the genetic variation of (1 ) said one or more respective eQTL 
tested in step (a) and (2) the variation of the first trait across said segregating population . 
20 are correlated conditional on an abundance pattern of the test gene across said segregating 
population, said test gene is identified as said first gene. 

182. The method of claim 178 wherein said second species is mammalian. 

25 183. The method of claim 178 wherein said second species is human- 

184. The method of claim 178 wherein said second trait is asthma, ataxia 
telangiectasia, bipolar disorder, cancer, common late-onset Alzheimer's disease, diabetes, 
heart disease, hereditary early-onset Alzheimer's disease, hereditary nonpolyposis colon 
30 cancer, hypertension, infection, maturity-onset diabetes of the young, mellitus, migraine, 
nonalcoholic fatty liver, nonalcoholic steatohepatitis, non-insulin-dependent diabetes 
mellitus, obesity, polycystic kidney disease, psoriases, schizophrenia, or xieroderma 

■ 

pigmoutosum. 



339 



wo 2004/061616 PCT/US2003/041613 

18S. The method of claim 178 wherein said molecular target is a gene. 



186. The method of claim 178 wherein said molecular target is an exon, an intron, 
or a regulatory element of a gene. 

5" • * 

187. The method of claim 178 wherein said marker is a single nucleotide 
polymorphism, amicrosatellite maiker, a restriction fragment length polymorphism, a 
short tandem repeat, a DNA methylation marker, a sequence length polymorphism, a 
random amplified polymorphic DNA, an amplified firagment length polymorphisms, or a 

10 simple sequence repeat. 

188. A method of identifying a molecular target for a second trait in a second 
species, the method comprising: 

(a) identifying a first gene in a segregating population that is causal for a first trait 
15 exhibited by all or a portion of said segregating population, wherein each member of said 

segregating population is a member of a first species and wherein said second trait in said 
second species corresponds to said first trait in said first species; and 

(b) identifying a second gene in the genome of the second species that is 
orthologous to said first gene and in- which (i) die variation of the abundance of the 

20 second gene across biological samples taken fi:om a plurality of members of said second 
species and (ii) the variation of the second trait across said plurality of members of said 
second species are associated, wherein 

said second gene is identified as said molecular target. 

25 189. The method of claim 188, the method further comprising: 

validating said second gene by determining whether a mark^ or a haplotype in 
said second gene associates with said second trait, wherein, when said marker or said 
h^lotype associates with said second trait in said second species, said second gene is 
validated. 

30 

190. The method, of claim 1 88 wherein said identifying said first gene in a 
segregating popxilation that is causal for a first trait exhibited by all or a portion of said 
segregating population cornprises: 



340 



wo 2004/061616 PCT/US2003/041613 

(a) identifying a test gene in said j5rst species that has at least one abundance 
quantitative trait locus (eQTL) coincident with a respective clinical quantitative trait locus 
(cQTL) for said first trait; and 

(b) testing, for one or more respective eQTL in said at least one eQTL, whether (i) 
5 the genetic variation of said eQTL across said segregating population and (ii) the 

variation of the first trait across said segregating population are correlated conditional on 
an abundance pattern of the test gene across said segregating population, 

wherein, when the genetic variation of (1) said one or more respective eQTL 
tested in step (b) and (2) the variation of the first trait across said segregating population 
10 are correlated conditional on an abundance pattern of the test gene across said segregating 
population, said test g^e is identified as said first gene. 

19L The method of claim 188 wherein said second species is mammalian. 

15 192. The method of claim 188 wherein said second species is human. 

193. The method of claim 188 wherein said second trait is asthma, ataxia 
telangiectasia, bipolar disorder, cancer, common late-onset Alzheimer's disease, diabetes, 
• heart disease, hereditary early-onset Alzheimer's disease, hereditary nonpolyposis colon 
20 • cancer, hypertension, infection, maturity-onset diabetes of the young, mellitus, migraine, 
nonalcohohc fiitty liver, nonalcoholic steatohq)atitis, non-insuhn-dependent diabetes 
mellitus, obesity, polycystic kidney disease, psoriases, schizophrenia, or xeroderma 
pigmentosum. 

25 194. The method of claim 1 88 wherein said marker is a single nucleotide 

polymorphism, a microsatelUte marker, a restriction firagment length polymorphism, a 
short tandem repeat, a DNA methylation marker, a sequence length polymorphism, a 
random ampUfied polymoiphic DNA, an amplified fitigmmt length polymorphisms, or a 
simple sequence repeat 



30 



195. A computer system for identifying a moleciilar target for a second trait in a 
second species, the computer system comprising: 
a central processing unit; 

a memory, coupled to the central processing unit, the memory storing: 



341 



wo 2004/061616 PCT/US2003/041613 

instructions for identifying a first gene in a segregating population that is causal 
for a first trait exhibited by all or a portion of said segregating population, wherein each 
member of said segregating population is a member of a first species and wherein said 
second trait in said second species corresponds to said first trait in said first species; 
5 instructions for mapping said first gene in said first species to a corresponding 

locus in the genome of the second species; and 

instructions for determining whether a marker or a haplotype in said 
corresponding locus in the genome of the second species associates with said second trait 

10 1 96. A computer program product for use in conjunction with a computer 

system, the compute program product comprising a computer readable storage medium 
and a computer program mechanism embedded therein, the computer program 
mechanism comprising: 

instructions for identifying a first gene in a segregating population that is causal 

15 for a first trait exhibited by all or a portion of said segregating population, wha-ein each 
member of said segregating population is a member of a first species and wherein said 
second trait in said second species corresponds to said first trait in said first species; 

instructions for mapping said first gene in said first species to a corresponding 
locus in the genome of the second species; and 

20 instructions for determining whether a marker or a haplotype in said 

corresponding locus in the genome of the second species associates with said second trait 

197. A computer system for identifying a molecular target for a second trait in a 
second species, the computer system comprising: 
25 a central processing unit; 

a memory, coupled to the central processing imit, the memory storing: 

instructions for identifying a fibrst gene in a segregating population that is causal 
for a first trait exhibited by all or a portion of said segregating population, wherein each 
member of said segregating population is a member of a first species and wherein said 
30 second trait in said second species corresponds to said first trait in said first species; 

instructions for identifying a locus in the genome of the second species that is (1) 
linked to said second trait and (2) maps to the position in the genome of said first species 
where said first gene resides; and 



342 



wo 2004/061616 PCT/US2003/041613 

instnictiom for detenniiiing whether a m 
coiresponding locus in the genome of the second species associates with said second trait. 

198. A computer program product for use in conjunction with a computer 

♦ • • • • 

5 system, the computer program product comprising a computer readable storage medium 
and a computer program mechanism embedded therein, the computer program 
mechanism comprising: 

instructions for identifying a first gene in a segregating population that is causal 
for a fibrst trait exhibited by all or a portion of said segregating population, wherein each 
1 0 m^ber of said segregating population is a member of a first species and wherein said 
second trait in said second species corresponds to said first trait in said first species; 

instructions for identifying a locus in the genome of the second species that is (1) 
linked to said second trait and (2) maps to the position in the genome of said first species 
where said first gene resides; and 
1 5 instructions for determining whether a marker or a haplotype in said 

coiresponding locus in the genome of the second species associates with said second trait 

1 99. A computer system for identifying a molecular target for. a second trait in a 
second species, the computer system comprising: • 

20 a central processing unit; 

a memory, coupled to flie coatral processing unit, the memory storing: 
instructions for identifying a first gene in a segregating population that is causal 
for a first trait exhibited by all or a portion of said segregating population, wherein each 
member of said segregating population is a member of a first species and wherein said 
25 second trait in said second species corresponds to said first trait in said first species; and 

instructions for identifying a second gene in the genome of the second species that 
is orthologous to said first gene and in which (i) the variation of the abundance of the 
second gene across biological samples taken firom a plurality of memb^ of said second 
species and (ii) fiie variation of the second trait across said plurality of members of said 
30 second species are associated. 

200. A computer program product for use in conjunction with a compute 
system, the computer program product comprising a computer readable storage medium 



343 



wo 2004/061616 PCT/US2003/041613 

and a computer program mechanism embedded therein, the computer program 
mechanism comprising: 

inistructions for identifying a first gene in a segregating population that is causal 
for a first trait exhibited by all or a portion of said segregating population, wherein each 
5 member of said segregating population is a member of a first species and wherein said 
second trait in said second species corresponds to said first trait in said first species; and 

instructions for identifying a second gene in the genome of the second species that 
is orthologous to said first gene and in which (i) the variation of the abundance of the 
second gene across biological samples taken &om a plurality of members of said second 
10 species and (ii) the variation of the second trait across said plurahty of members of said 
second species are associated. 



344 



wo 2004/061616 



PCT/US2003/041613 



10 




20 



22 



CPU 




34 



Memory 
24 




8 



oaaaaDDD 
DDaaanana 
ooaaoanDDD 



mo/ 

V 



User Interface 



r 



28 



NIC 



Operating system 



File system 



Gene expression / cellular 
constituent data 



Organism 1 



Gene 1 / cellular constituent 1 



Intensity or cellular 
constituent level 1 



Optional background signal 
Optional gene probe annotation 



Gene N / cellular constituent N 



Organism Q 



Genotype and pedigree data 



Marker data 



Normalization module 



Marker map construction module 



Expression / genotype warehouse 



Genetic marker map 



Genetic analysis module 



QTL results database 



QTL1 



Position 1 



Statistical score 



QTLM 



Clustering module 



Cluster database 



Multivariate QTL analysis module 



Phenotypic data 



40 
42 

44 

46-1 

48-1-1 

50-1-1 

52-1-1 
54-1-1 



48-1 -N 



46-Q 
68 

70 

72 

74 

76 

78 

80 

82 

84-1 

86-1-1 

88-1-1 



84.-M 

92 

94 

90 

95 



FIG.l 

1/91 



wo 2004/061616 



PCT/US2003/041613 



r 



202 



44 



Cellular constituent data 



70 



Marker data 



68 



Genotype and pedigree data 



204 



Transfonn gene expression data 
44 into expression statistics 



206 



Generate a genetic marker map 
from genetic markers 70 



r 



208 



FIG. 2 

2/91 



1 Generate Expression / genotype warehouse 76 




\ r 


210 


Id lui 1 1 1 w 1 u cii iciiyoio ui 1 all yv;i lob 11 1 Well c^nuUoc 1 O 




1 




Pipe QTL results for each gene at each location 
lesiea inio u i l results aataoase oc. 




i 

T r 




Create a QTL interaction map using clustering 
module 92, store results in cluster database 94 




r 


216 


Generate a gene expression cluster map from gene 
expression statistics using clustering 




1 r 


218 


Populate cluster database 94 with clusters of QTL 
interactions from the QTL interaction map and 
clusters of gene expression interactions from the 
gene expression cluster map 




\ r 


220 


Filter QTL interaction map data in order to identify a 
candidate pathway group. . . 




1 r 


222 


Subject the genes of a candidate pathway group to 
multivariate analysis in order to detennine whether 
they affect a trait 





wo 2004/061616 



PCT/US2003/041613 



76 



Expression / genotype warehouse 



Gene 1 



Expression statistic set 1 




Organism 1 






Expression statistic 1 




Organism 2 






Expression statistic 2 




• 
■ 
• 




Organism N 






Expression statistic N 



Gene M 



Expression statistic set M 




Organism 1 






Expression statistic 1 




Organism 2 






Expression statistic 2 




• 
• 
• 




Organism N 






Expression statistic N 



302-1 

304-1 

306-1-1 

308-1-1 

306-1-2 

308-1-2 



306-1 -N 
308-1 -N 



302-IVi 
304-IVI 
306-l\4-1 

308-l\/I-1 
306-i\/l-2 
308-1^-2 



306-l\/l-N 
308-IVI-N 



FIG. 3A 

3/91 



wo 2004/061616 



PCTAJS2003/041613 



304-G 



Expression statistic for gene G from organism 1 
Expression statistic for gene G from organism 2 
Expression statistic for gene G from organism 3 
Expression statistic for gene G from organism 4 



308-G-1 
308-G-2 
308-G-3 
308-G-4 



Expression statistic for gene G from organism N 



308-G-N 



FIG. 3B 

4/91 



wo 2004/061616 



PCTAJS2003/041613 



76 



2i 



Expression / genotype warehouse 



Gene 1 



Expression statistic set 1 




Organism 1 






Expression statistic 1 / Tissue a 






Expression statistic 1 / Tissue b 






m 

m . 
m 




m 
m 




• 

Organism N 






Expression statistic N / Tissue a 






Expression statistic N / Tissue b 






• 

• ■ 



Gene M 



Expression statistic set M 



Organism 1 



Expression statistic 1 / Tissue a 



Expression statistic 1 / Tissue b 



Organism N 



Expression statistic N / Tissue a 



Expression statistic N / Tissue b 



302-1 
304-1 
306-1-1 
308-1-1 -a 
308-1 -1-b 



306-1 -N 
308-1 -N-a 
308-1 -N-b 



302-IVl 
304-1 . 

306-l\/I-1 
308-M-1-b 



306-l\/l-N 

308-IVI-N-a 

308-M-N-b 



FTG. 3C 

5/91 



wo 2004/061616 



PCTAJS2003/041613 



82^ 



QTL1 


^84-1 




Position 1 


■^86-1-1 






statistical score 


^88-1-1 




Position 2 


^86-1-2 






statistical score 


88-1-2 




■ 
• 
• 






Position X 


■^86-1-X 






statistical score 


^88-1-X 


• 
• 




QTLM 


^84-M 




Position 1 


86-M-1 






statistical score 


88-M-1 




Position 2 


86-M-2 






statistical score 


-^^ 88-M-2 




• 
• 
■ 






Position X 


86-M-X 






statistical score 


88-M-X 



FIG. 4 

6/91 



wo 2004/061616 



PCTAJS2003/041613 



Parentals, F1 and F2 mice 



Physiological 
Phenotypes 



QTL 
analysis 



t 



Genetic loci 



mRNA expression levels 
(genes; gene clusters) 



▲ 



QTL 
analysis 



t t 

Genetic loci 



FIG. 5 



7/91 



wo 2004/061616 



PCT/US2003/041613 



V V V V 
o a 



CS- S C3 
> > > 

y T T T 

M ^2^rf ^Z^tf 

f| 

i 9 9 S 

3 .2 .S 

j- ^3 '^M 

ij n C5 

' M Bib 

tSifr 

& & 

« « « 

^ ild feid ad 

13 ^ 

A BMri MHari bhI 

^ >5 

.^^Tt 

I \o <3> t*^ 




■ !'•: 




d 



CM 



iiifil 




1 

t - * * 

■ " - ■■ » • 

i * ' ' . 


.■ r : • 


' f 

t 

i 


t * ' 

r : ? 

> 




I " , ' "■ 

1 ■ " 

: 





oQpi Qozt am 



m9 009 DQ^ 
SiN3D dQS dO inO ON 

FIG. 6 



r 



8/91 



wo 2004/061616 



PCT/US2003/041613 




FIG. 7 

9/91 



wo 2004/061616 



PCT/US2003/041613 




aiEOS po"| 

FIG. 8 



10/91 



wo 2004/061616 



PCT/US2003/041613 



0) 



c 



CD 

to 

13 



.as A— \ 



UU 



t 

o 

o 

< 



3 
< 



i 

3 

o 
8 

CO 

Q 
Q 



c 

O g 



t 



i 



» 




E 
o 

o 

e 

o 
O 




< 

ID 
O 



Z5 <J> 

> W 

S I 

CM 



CN 
U. 

< 
CD 

a 



C4 

U 01 8 

O 

U 
< 



2 I 



8J00S aoi 



FIG. 9 

11/91 



wo 2004/061616 



PCT/US2003/041613 




06 91 Ul. 0 

FIG. 10 



12/91 



wo 2004/061616 



PCTAJS2003/041613 




(PRIOR ART) 
FIG. 1 1 

13/91 



wo 2004/061616 



PCT/US2003/041613 



mjRQtoaijg^ nucleotide 





95300 10C24Rik 






EST AW456442 




5' 




EST A W5 40 195 



proirnudeoside 





N-terminal ^gu 







(PRIOR ART) 
FIG. 12 



14/91 



wo 2004/061616 



PCT/US2003/041613 



-I 




« 



FIG. 13 



15/91 



wo 2004/061616 



PCTAJS2003/041613 




E 
o 
to 
o 



U 



Fia 14 



16/91 



wo 2004/061616 



PCTAJS2003/041613 




17/91 



wo 2004/061616 



PCT/US2003/041613 




1 

o 



o 
o 

o ^ 

Kg 
I 



O 

^ I 



FIG. 16 



18^1 



wo 2004/061616 



PCT/US2003/041613 



t 



SO 



/ 



CO 



j-1 00 



§ 



CO 



\ I 



SI- 



O 

^ a 



u 
o 

1-1 



o ^ 

I 



. I 

II 

PQ o 



FIG. 17 



19/91 



wo 2004/061616 



PCT/US2003/041613 




FIG. 18 



20/91 



wo 2004/061616 



PCT/US2003/041613 



1902 



44 



Gene expression data 



70 



Marker data 



r 



68 



Genotype and pedigree data 



95 



Phenotypic data 



r 



1904 



Transform gene expression 
data 44 into expression 
Statistics 



1906 



Generate a genetic marlcer 
map from mariner data 70 



r 



1908 



Generate Expression / genotype warehouse 76 



1910 



identify an expression quantitative trait ioci (eQTL) 
for a gene G using QTL analysis in wliicii a plurality 
of expression statistics (e.g., expression statistic set 
304) (Fig. 3A / Fig. SB) is the quantitative trait used 
in the QTL analysis 



1912 



Identify a dinicial quantitative trait loci (cQTL) that is 
linked to a clinical trait T using QTL analysis in which 
a phenotypic statistic set 2102 for the clinical trait T is 
the quantitative trait used in the QTL analysis 



1916 

Does 
the eQTL colocalize" 
to the physical location 

Vac \ 9®"© G the 
^ genome? 

No 



Yes 



^ 1914 

Does 
the eQTL and the' 
cQTL colocaiize to the 
same locus in the 
genome? 



No 



1918 



Select another gene G when available 



c 



1920 



Gene G is associated with clinical trait T 

FIG. 19 



21/91 



wo 2004/061616 



PCTAJS2003/041613 




FIG. 20 



22/91 



wo 2004/061616 



PCT/US2003/041613 



95^ 




Phenotypic Statistic Set for Clinical Trait 1 


^2102-1 

mm I %0mm ■ 




Phenotypic value for Organism 1 


2104-1-1 

Mm ■ T 1 ■ 




Phenotypic value for Organism 2 


■^2104-1-2 




Phenotypic value for Organism 3 


^2104-1-3 




■ 

• 

• 
• 


• 




Phenotypic value for Organism Q 


■^21 04-1 -Q 


• 
• 

• 

m 




Phenotypic Statistic Set for Clinical Trait Z 


~^2102-Z 




Phenotypic value for Organism 1 


"^2104-Z-1 




Phenotypic value for Organism 2 


■^2104-Z-2 




Phenotypic value for Organism 3 


~^2104-Z-3 




• 

• 
• 






Phenotypic value for Organism Q 


21 04-Z-Q 


FIG. 21 





r 



20 



Expression quantitative trait loci (eQTL) identification module 
Clinical quantitative trait loci (cQTL) identification module 
Determination module 



2202 
2204 
2206 



FIG. 22 



23/91 



wo 2004/061616 PCT/US2003/041613 




FIG. 23 

24/91 



« 



wo 2004/061616 



PCT/US2003/041613 



2102-1 



Clinical trait 1 
Orgi 

Org 2 



1912 



t 



Org N 



cQTLI 



2102-2 



Clinical trait 2 
Org 1 

Org 2 
Org N 



t 

CQTL2 



2102-3 



Clinical trait 3 
Org 1 

Org 2 
Org N. 



t 

CQTL3 

— H 



2102-4 



Clinical trait 4 
Orgi 

Org 2 
Org N 



CQTL4 
— K 



r 



2402 



eQTL1-1 



eQTL1-2 
eQTL2-1 ' 



eQTL1-3 
eQTL2-2 
eQTL3-1* 



eQTLI -4 
eQTL2-3 

eQTL3-2 
eQTL4-1* 



1910 



eQTL1-1 
eQTLI -2 
eQTLI -3 
eQTL1-4 

▲ 



Gene 1 
Gene 1 



Orgi 
Org2 



eQTL2-1 
eQTL2-2 
eQTL2-3 

▲ 



eQTL3-l' 
eQTL3-2 

▲ 



eQTL4-1 
▲ 



Q®"®2org1 ^®"®30rg1 ^^"^'^Orgl 
^®"®2o^g2 ^®"®^0rg2 ^®"®'^Org2 



Gene 1 



OrgN 



304-Genel 



^®"®2o^gN ^^"^^o^gN ^^"^^OrgN 



304-Gene2 



304-Gene3 304-Gene4 



FIG. 24 



25/91 



wo 2004/061616 



PCT/US2003/041613 



Select a gene G from reference species (e.g., a mouse gene) that 
was identified using quantitative genetics metliods (e.g., a gene 
verified in processing step 222 of Fig. 2 or a gene tliat has been 
associated with ciinical trait T in processing step 1 920 of Fig. 19) 

'~ ^2502 




Yes 




2506 



BLAST search of all known nucleotide sequences in target species 
using the nucleotide sequence of gene G to obtain best match G' 



i 



^2508 



BU\ST search of protein sequences in target species using the 
translated amino acid sequence for gene G, denoted P, to obtain 
best match P' 



2510 



No 



Is P' the protein 
product of G'?, 

Yes 



2512 



BLAST search of nucleotide sequences in reference species with G' 



/-2514 



BLAST search of protein sequences in reference species using P' 



2516 



Yes 




2518 



Ortholog to gene G in the target species has not been identified 



Ortholog to gene G in the target species has been identified 

FIG. 25 



2520 



26/91 



wo 2004/061616 



PCTAJS2003/041613 



^2602 

Generate a response profile for gene x 
over a large condition set {A} in reference 
species 



^2604 



Optionally, identify maximally informative 
subset of condtions {a} out of {A} that can 
be profiled in organism Y 



± 



^2606 



Obtain response profiles over condition 
set {a} in target species for a gene y in 
the target species 



± 



^2608 



Correlate over conditions {a} the 
response of Gene x in the reference 
species with gene y in the target species 



2610 



Is another 
gene y available for 
analysis? 



r 



2612 



Declare functional relateness (orttiolog) 
based on high correlation 



FIG. 26 



27/91 



wo 2004/061616 



PCT/US2003/041613 



1 GAGCTATTCG GCCTCTCTAQ QCCGGCGGGT CCTCCGCTCC ATGQTCCTGT CTGTCAGCGC 
61 TGTGTCAGGA GGCCAGTGCC GAGGTCCGGT C6CGCTCCGA CGCTTCGACC CTCGAGCCGG 
121 TCGCQGGTAT CCCG6CGGCC GCGGGACGAT GGCGTGGTGG CACTQACAGG CGCGGGCGGC 
181 TGCCGAGCCC CGCGGGCGGC ATGGCGGGCC AGTTCCGCAG CTACGTGTGG GACCCGTTGC 
241 TAATCCTGTC GCAGATCGTA CTCATGCAGA CCGTCTACTA TGGCTCTCTG GGCCTGTGGC 
301 TGGCGCTGGT GGACGCGCTG GTGCGCAAGC CCGTCCCTGG ACCAGATGTT CGACGCGGAG 
361 ATCCTGGGCT TCTCCACCCC TCCAGGCCGG CTCTCAATGA TGTCCTTCGT CCTCAACGCC 
421 CTCACCTGTG CCCTGGGCTT GCTGTACTTC ATCCGGCGAG GGAAGCAGTG CCTGGATTTC 
481 ACTGTCACTG TGCATTTCTT TCACCTCCTG GGCTGCTGGC TCTACAGCTC CCGTTTCCCC 
541 TCGGCGCTGA CCTGGTGGCT GGTCCAGGCT GTGTGCATTG CACTCATGGC CGTCATCGGG 
601 GAGTACCTGT GCATGCGGAC GGAGCTCAAG GAGATCCCCC TCAGCTCAGC CCCTAAGTCC 
661 AATGTCTAGA GTTGGGCCCT TTGGACACTC TGCTGGCACT TGGGCCCCAT CACCTTGGGC - 
721 TGCTCAGACC TCCAGATGGG GTCTGGCCCA AGTCTQAGCA GAACCCTGQA AATGTGAAGT 
781 CTGXTGGTGG AGAGATAATQ AGGTCCCATC ATAAAGGCAG GTAGCAGCCA TGATCACAGA 
841 TGTAAGAATG GCCTCTGTCT GCCATAGCCT TQATATCTGG AGGCCAGTAA GGGACCTCAT 
901 GGAGGGTAGT GGCAGATTTQ GAACCATGTC ACATGAGCCA TCATACTGTC ACCAGCCTGT 
961 TATTTTAAAA AGAAAAAAAA AAAATCAAGQ ATATCTQATT GGAATAAACC ACTCTTCTCG 
1021 TTGTCTGTCT TATGCCCATG ACASCCAGTA CCTTTGCTGT GTTGCCAAAC CACAGGGATT 
1081 CTCTGTGGAG AAATACCTGA TTTCTGGGTC CATAGCCACA GAAAAAGATG TAGGTACAGA 
1141 GTGCTAGGCT GCTGACAGGA CGTCGAGGGG AGGAGGCATC AAGCACAAGA AAAATGCATG 
1201 GCGGTGCCGT TAGACACACA CACACACTTT TGTGTGTGTC CAGGACCCAT GACTGTCTCC 
1261 CTCCAGTTCC CTGTATGGAC TCTGCCTTGC TGTTGTCACT CAGCACAGCC AGAGACAGGA 
1321 CCCAGAGAAA ACCCCAGCAT CCCTCCCAGC CTTCCCTTCA TAATAAAAGC CATTGTCTGC 
1381 TCTCTGGAAG TGAGCAGGCA GCCAGCTTCT ACTGGACCTC AACTGTGGCA GGAGTTTCTG 
1441 TTTGCTGTCT TTTGAGTTCT GTGATAGGGA GGGTGTACTA AAGGTGCTGG AGGCTCACCC 
1501 TGCTAAGCTT TCTTCCAAGT GGTTTCCTCA GGAAGGGCTG GCAGCTGTCC TTCCTAGGTA 
1561 CATAAATACA CTATTTTCCA ATC 



Figure 27 . 



28/91 



wo 2004/061616 



PCT/US2003/041613 



TCTAGGCCGGCa^GCCTCTCbTCCATGGTCCTGTCTGTCAGCGCTGT^ 

CGGGCCACGCTCAGACACTTCXSATCGTCGAGTCTGTC^ 

GTGGGACCCQCTGCTGATCCTGTCGCa^GATCGTCCrCATGCaiG^ 

TGGCTGGCGCTGGTGGACGGGCTAGTGCGACAGCCCCTCGCTGGACCS^GATGTTCGACGCCC^ 

GGCTTTTCCACCCCTCCAGGCCGGCTCrCCATGATGTCCTTCATCCTCAACGCCCTCACCT 

GCTTGCTGTACTTCATCCGGCGAGGAAAGCAGTGTCTGGATTTCACTGTCACTGTCC^ 

CCTGGGCTGCTGGTTCTACAGCTCCCGTTTCCCCTCGGCGCTGACCTGGTGGCTGGTCCAAGCCGTGTGC 

ATTGCACTCATGGCTGTCT^TCGGGGAGTACCTGTGCATGCGGACGGAGCTCTAGGAGATAGCCCTCA^ 

Cy^GCCCCTAAATCCAATGTCTAGAATCAGGCCCTTTGGACATCCTGCTGACT^CTTGGGCCCCTTA^ 

TTGGGCTGCTCAGACCCTCC^VGATGAGGTCCTIGCCCAGATCTGAGAGGAACCCTGGAAATGTGA^ 

TGTTCGTTTGGGAGAGATAGTGAGGGCCTGTCAAAGAAGGCTIGGTAGCAGTCAGC^ 

ATGACCTCTGTCTGTTGAAGCCTTGGTATCTCAGAGGTCAGGAAGGGGACCTCTTTGAGGGTAATAACAG 

AATTGGAACCATGCCACTCTTGAGCa^CAATACCTGTCACCAGCC^ 

CAAGGATATCTGATTGGAGCAAACaiCTTCTTTAGTCATCTGTCTTACCCCCCTGGGAC^ 

TTGCAGTGTTOCCGAATCACAGCAGTTACCTTTGCAGTGTTGCOa^ 

CGCTTGGTTTCCGGATCCa^GAGCCACAGAAAGAAATGTAGGTGTGAAQTATT 

GGATGGCAGATGGAGGCATCAAGCACa^GGAAAATGCACTACCTGT^ 

GCACCCAAGAACCTATGACTTTCTTCCAGTTCCTTCTACCAGGTCCCaiTCCTGCTGCCAGCTCTCAA^ 

TAGCAGGCCATAGGACCCAGAGAAGAATCCCAGCGTTGCTCAAAGTCTAACCM'C^ 

GTCTTCTAGGAATGACCAGGCACCCAGCTCCCACTGGACTCCAATTT^ 

CTTTGGCGGGAAGGGTATGATGGGTTCCCAGAGACAAGAAGCCCAACCTTCTGGCCTGGGCTGTGC 
AGTGCIKSAGGGAGATAGGAATITGCTGCTAAGATTTTTCTTTGGGGTGGAGTTTCCTCTGT^ 
GCAGCTATCCTTCCTGTGTATACAAATACAGTATTTTCCATGGTTCTGCCTGCACTTACTTTGTAA 
ACGGTTGAGATTGAGAGAGATCAGCGCAGCCAGGCAAGGGAAGTTTAAAGAATTATTAGGCCACCCT 
XOTTTCCTGGAGCCCAGAGTCATTCCTCCATTTGGTTAAAATACTCAGTGCAGGGAACT 
TCrCCITCT^CTTGCyVGCGTCCCCTGCTATGCCTCAGGTGAACCACATAATTCT^ 
TTGCTAGTGATTTCTGAACATGTTCAATGGAQCGGCACACAGTCTAGACCCACTTCCGCAT 
CACTGTTCCTCTTTGGTTTCTTCAGAGCTTTCCCAAGAGAGCTGTC^ 
ATGAGTTTATGGTAACa^OVAATGAGTTTTGCT 

CCCTACAQAGTAGGGAGTTGATGCTGACaiGGATGAAGATTTAGGAATi^ 

GAAGGTTCTAGGGTGAGGCACCTCAGTAACTCATGGTACCTTGG 

ATGAGGCy^CAGTAATCCTGGCTGCAGGGTCTAGQAGGTAAGACCAGCTGGGATC 

TCAATTTCCCTCTAGACAACAC?UVACTGCAGGCATGTGACTAACTTTGAAAGAA 

GCTGTCACCCTTGACCAGCCGTGGTGGTGGTTACTCCATCTGTGGTTGGAGCGCCTCTTO 

TCAAGGTCTTGTGCCTATTTTTCTGCATATCTTCTGTGATGACAAATCTCTGTCCCCTGAGTGTTAATT^ 

GATTTTTAGAAATGGCCAAAAGTCACGTGATCCAAACTTTTTl^ 

GTAGTTGGGGATCAAAAATATGTGACCTTAATGAGATTTTTATGAT^ 

TTTTAGAGTTGAGTTCCAGAGAGGGCAGGGCAATGGCAGTGACATGTTTGT^^ 

ATCTATTGAGTGCTTAA 



Figure 28 



29/91 



wo 2004/061616 



PCT/US2003/O41613 



( 

ATGGCGGGTC^IGTTCCGCAGCTACGTGTGGGACCCGCTGCTGATCCT 

CCGTGTATTACGGCTCGCTGGGCCTGTGGCTGGCGCTGGTGGACGGGCTAGTGCGACAGCCCCTCGCTGQ 

ACCAQATGTTCGACGCCGAGATCCTGGGCTTTTCCACCCCTCCAGGCCGGCTCTCCATGATGTCCTTCAT 

CCTCAACGCCCTCACCTGTGCCCTGGGCTTGCTGTACTTCATCCGGCGAGGAAAGCAGTGTCTC 

ACTOTCACTGTCCATTTCTTTCACCTCCTGGGCTCCTGGTTCTACAGCTCCCGTI^ 

CCTGGTCGCTGGTCCMGCCGTGTGCATTGCACTCATGGCTGTCATCGGGGAGTACCTGT^ 

G6AGCTCAAGGAGATACCCCTCAACTCAGCCCC 

Figure 29 



MAGQFRSYVW DPLLILSQIV LMQTVYYGSIi GLWLALVDAL VRSSPSLDQM FDABILGPST 
PPGRLSMMSF VIiNALTCALO LLYFIRRGKQ CLDFTVTVHF FHLIiGCWIiYS SRFPSALTWW 
liVQAVCIALM AVIGEYLCMR TELKEIPLSS APKSNV 

Figure 30A 



MALWACGWRW WTRWCAQPVP GPDVRRGDPG LIiHPSRPALN DVLRPQRPHL CPGLAVUSPA 
REAVPGFHCH CAFLSPPGLL ALQIiPFPLGA DLVAGPGCVH CTHGRHRGVP VHADGAQGDP 
PQLSP 

Figure 30B 



MAGQFRSYVW DPLLILSQIV LMQTVYYGSL GLWLALVDAL VRKPVPGPDV RRGDPGLLHP 
SRPALNDVLR PQRPHLCPGL AVLHPAREAV PGFHCHCAFL SPP6LLALQL PFPLGADLVA 
GPGCVHCTHG RHRGVPVHAD GAQGDPPQLS P 

Figure 30C 

* 

MAGQFRSYVW DPLLILSQIV LMQTVYYGSL GLWLALVDAL VRSSPSLDQM FDAEILGFST 
PPGRLSMMSP VLNALTCALG LLYFIRRGKQ CLDFTVTVHF FHLLGCWLYS SRFPSALTWW 

LVQAVCIALM AVIGEYLCMR TELKEVPLSS APKSNV 

Figure SOD 



30/91 



wo 2004/061616 



PCTAJS2003/041613 



PFPGSRGPQL FGLSRPAGPP LHGPVCQRCV RRPVPRSGRA PTLRPSSRSR VSTUtPKDDGV 
VALTGAGGCR APRAGMAGQF RSYVWDPLLI LSQIVLMQTV YYGSIiGLWWR WWTRWCAQPV 
PGPDVRRGDP GLLHPSRPAL NDVLRPQRPH LCPGIAVIoHP AREAVPGFHC HCAFIiSPPGL 
LAIjQLPFPLG ADLVAGPGCV HCTHGRHRGV PVHADGAQGD PPQhSF 



figure 30£ 



MAGQFRSYVW DPLLILSQIV liMQTVYYGSL GLWLALVDAL VRKPVPGPDV RRGDPGLLHP 
SRPALNDVLR PQRPHLCPGL AVIiHPARBAV PGPHCHCAFL SPPGLLALQIi PFPLGADIiVA 
6P6CVHCTH6 RHRGVPVHAP GAQGDPPQIiS P 



Figure 30F 



MAGQFRSYVW DPLIjILSQIV LMQTVYYGSL GLWLALVDGL VRQPLAGDPV RRRDPGLFHP 
SRPAIiHDVIiH PQRPHLCPGIi AVLHPARKAV SGPHCHCPFIi SPPGLLVLQL PFPLGADIiVA 
GPSRVHCTHG CHRGVPVHAD GAQGDTPQLS P 

Figure 31 



31/91 



wo 2004/061616 



PCT/US2003/041613 



1 GCACGAGGGC GGGCGCX3CGC GTGGGCGCAG CGCGGAGCGG GGCCCATGGT GCGGCCGTGT 
61 CCGTCGGTCG GGCCGCGCGG GCGGCTCCGC .GCGTGGCCCG GCGCTCGC6A CCTCGCCCCT 
121 GCGCTGCGGG CCCGGCCCGC CCGCTGCCGG CGCCTCCTCC CCCTGCCCCG GGGCGGCGCG 
181 GAGGCCGCGG GGAGCGCAGG GGGCGCGGCG GGCGGCGACA TGACGGACAG CATCCCGCTG 
241 CAGCCCGTGC GCCACAAGAA GCGGGTGGAC AGTAGGCCGC GCGCGGGGTG CTGTGAGTGG 
301 CTGAGATGTT GCGGTGGAGG GGAGCGCAGG CCCCGTACTG TCTGGTTGGG ACACCCCQAG 
361 AAGAGGGACC AGCGGTACCC TCGAAATGTC ATCAACAACC AGAAGTACAA TTTCTTCACA 
421 TTTCTTCCTG GGGTGTTGTT CAGCCAGTTC AGATACTTCT TCAACTTCTA CTTCCTGCTT 
481 CTCGCCTGCT CGCAGTTCGT CCCAGAGAT6 AG6CTTGGCG CCCTGTACAC CTACTGGGTT 
541 CCTCTGGGCT TCGTGCTGGC TGTCACCATC ATCCGTGAGG CAGTAGAGGA GATCCGATGT 
601 TATQTGCGTG ACAAGGAGAT GAACTCCCAG GTCTACAGCC GGCTCACGTC ACGAGGGACC 
661 6TGAAGGTGA AGAGTTCAAA CATCCAGGTG GGAGACCTCA TCCTTGTGGA AAAGAACCAG 
721 CGGGTCCCTG CTGACATGAT CTTCCTGAGG ACGTCAGAGA AAAACGGCTC TTGCTTCTTG 
781 CGCACGGATC AGCTGGATGG AGAGACAGAC TGGAAGCTTC GGCTCCCGGT GGCCTGCACA 
841 CAGAGGCTTC CCACGGCTGC TGACCTCCTG CAGATTCGGT CCTATGTGTA CGCTGAAAAA 
901 CCCAACATCG ACATTCACAA CTTCCTGGGG ACTTTCACCA GGGAAAACAG TGACCCTCCG 
961 ATCAGTGAGA GTCTGAGCAT TGAGAACACG CTGTGGGCCG GCACCGTCAT AGCATCAGGC 
1021 ACTGTTGTAG GCGTTGTTCT CTACACTGGC AGAAAACTGC GGAGTGTCAT GAATACTTCC 
1081 GACCCCAGAA GTAAGATTGG CCTGTTCGAC CTGGAGGTGA ACTGCCTCAC CAAAATCCTG 
1141 TTTGGTGCGC TGGTGGTGGT GTCCCTGGTC ATGGTGGCCC TGGAGCACXT TGCCGGCCQC 
1201 TGGTACCTGC AGATCATCCQ CTTCCTGCTC CTGTTTTCCA ACATCATTCC TATCAGCTTG 
1261 CGTGTGAACT TGGACATGGG CAAGATCGTG TACAGCTGGG TGATCCGCAQ GGATTCCAAA 
1321 ATCCCCGGGA CCGTGGTTCQ TTCCAGCACA ATTCCT6AGC AQCTQGGCAG GATTTCGTAC 
13 81 TTGCTCACAG ACAAGACAGG AACCCTGACC CAGAATQAGA TGGTGTTCAA GCGGCTGCAC 
1441 CTGGGTACGG TGGCCTACGG CCTGGACTCC ATGGACGAAG TGCAGAGTCA CATCTTCAGC 
1501 ATTTACACCC AGCAATCCCA GGATCCACCT GCTCAQAAGG GCCCCACGGT CACCACCAAG 
1561 GTCCGGAQGA CCATGAGCAG CCGTGTCCAC GAGGCTGTGA AGGCCATTGC ACTCTGCCAC 
1621 AACGTGACAC CCGTGTACGA GTCCAATGGT GTGACGGACC AGGCTGAGGC TGAGAAGCAG 
1681 TTTGAGGACT CCTGCCGAGT GTACCAGGCA TCCAGCCCGG ATGAGGTGGC TCTGGTCCAG 
1741 TGGACAGAAA GTGTGGGACT QACGCTGGTG GGTCGAGACC AGTCCTCCAT GCAGCTGAGG 
1801 ACCCCTGGTG ACCAGGTCCT GAATCTCACC ATCCTTCAGG TCTTCCCGTT CACCTATGAG 
1861 AGCAAGCGGA TGGGCATCAT CGTGCGGGAT GAGTCCACGG GGGAAATCAC GTTCTACATG 
1921 AAGGGAGCAG ACGTCGTCAT GGCTGGCATT GTCCAGTACA ACGACTGGCT GGAGGAGGAG 
1981 TGTGGCAACA TGGCCCGGGA GGGACTACGT GTGCTGGTGG TAGCCAAGAA GTCCCTCACA 
2041 GAGGAGCAGT ACCAACACTT TGAAGCCCGC TACGTCCAGG CTAAGCTGAG TGTGCATGAC 
2101 CGCTCGCTGA AGGTGGCCAC GGTGATCQAG AGCTTGGAGA TGGAGATGQA GCTGCTGTGC 
2161 CTGACTGGTG TGGAGGACCA GCTGCAGGCA GATGTCAGGC CCACGCTGGA GACGCTGCGC 
2221 AACGCTGGCA TCAAGGTTTG GATGCTAACA GGGGACAAGC TGGAGACAGC CACGTGCACA 
2281 GCCAAGAAGG CACATCTGGT GACCAGAAAC CAAGATATCC ATGTTTTCCG ACTGGTGACC 
2341 AACCGCGGGG AGGCCCACCT GGAGCTGAAT GCCTTCCGTA GGAAGCATGA CTGTGCCCTG 
2401 GTCATCTCTG GAGACTCCCT GGAGGTTTGC CTCAAATACT ATGAGTACGA GTTCATGGAA 
2461 CTGGCCTGCC AGT6CCCGGC TGTGGTGTGC TGCCGCTGTG CCCCAACCCA GAAGGCCCAG 
2521 ATTGTTCGGC TGCTCCAAGA ACGCACCGGG AAACTCACCT GTGCAGTATG GGACGGAGGC 
2581 AATGACGTCA GCATGATCCA GGAATCCGAC TGCGGCGTGG GCGTGGAGGG CAAGGAAGGG 
2641 AAGCAGGCCT CGCTGGCAGC GGACTTCTCC ATCACCCAGT TCAAGCATCT CGGCCGCTTG 
2701 CTCATGGTGC ACGGTCGGAA CAGCTACAAG CGCTCGGCGG CCCTCAGTCA GTTTGTGATC 
2761 CACAGGAGCC TCTGCATCAG CACCATGCAG GCTGTCTTCT CGTCTGTGTT CTACTTTGCA 



Figure 32A 



32/91 



wo 2004/061616 



PCT/US2003/041613 



2821 TCCGTTCCTC TCTACCAAGG CTTCCTGATC ATTGGGTATT CTACCATCTA CACGATGTTT 
2881 CCCGTGTTCT CCCTGGTTTT GGACAAAGAC GTGAAGTCGG AAGTCGCCAT GTTGTATCCT 
2941 GAGCTCTACA AGGACCTGCT TAAGGGGCGG CCACTGTCCT ACAAGACGTT CTTAATTTGG 
3001 GTGTTAATCA GCATCTATCA AGGGAGCACC ATCATGTACG GGGCGCTGCT GCTGTTCGAG 
3061 TCGGAGTTTG TACACATCGT GGCAATCTCC TTCACATCCC TCATCCTCAC TGAGCTACTG 
3121 ATGGTGGCGC TCACCATCCA GACGTGGCAC TGGCTCATGA CAGTGGCCGA GCTACTCAGC 
3181 CTGGCCTGCT ACATTGCCTC CCTGGTGTTC CTCCATGAGT TCATCGATGT CTACTTCATT 
3241 GCCACCCTGT CATTCCTCTG GAAGGTGTCC GTCATCACCT TGGTCAGCTG TCTCCCCCTC 
3301 TATGTCCTCA AGTACCTGCG GAGACGGTTC TCCCCACCCA GCTACTCGAA GCTCACTTCC 
3361 TAAGGTGCAG GGCTGCCTCG GGCAGGGCCT CCGGCCTCCG GCGCTWTCCC CAGGAGGAGG 
3421 TCAAGTTCCA CACGCACGAQ CCGCCTCTGC TGGACGGTGC AGTCATGGCT GGCACATGAG 
3481 GCTTCGCTGA GGCGACACTG GGCACCTAAT GGGGATQGAA CATTGGTGGA ACCGGAGGGA 
3541 GGGACCTGAG AGCTGTACCT ATCAGAACCT TGGGTQCTAA GCTGTGCTGA GGGGGAAGAC 
3601 GTGGGACCGQ ATGGCCCGTC TGAGGTTTGT GGGGTCACTG TGCAAGCTTC CCTTATGGTT 
3661 TGAACCTCTT GCCTGCAGCC CGGGG 



Figure 32B 



33/91 



wo 2004/061616 



PCTAJS2003/041613 






a 

CL 

u 

3 
O 

(/) 

8 



Q-D Q Q Q Q Q Q 



<<<<<<<< 

LULlJliJUJIlJULJUJIU 

H 



F: 1 

aw iu 

,„ W W WfeS 

o<<< 




^^^^ 




0 f 0 



III 



«3 



i2 





j-fcn 



0) OS 



0) Q)KtJ 

^ ^ m 

CD to {{5 tn 

L« I— ^ (0 

<<0 u 



O 



01 cn 

cj cr 
o o 

CN CN 



34/91 



wo 2004/061616 



PCT/US2003/041613 




E 
E 
> 

> 

E 
> 



(Ml 
WD 



e 

.52 

O 

c 

o 
a 



i3 

o w 
. c 



a 

c c 

tu o 



4-J 

03 

O 

c 
c 
< 



JZ I— 
4-1 0) to 

o t < 
0) (□ b 



o 
ro 
o 



0) 

u 
c 

'I 



CD 



in 
o 

CO 

o 
a 

X 



"o *^ "o "ti 



mmJ 

m 











CO 








O 














il 


< 






< 


< 




GG 


< 



0 



-rH 

-rH 
O 

CD 

< 



in 
in 
in 
rsi 
o 

< 



£ E E E 











in 








rsi 


1-1 


o 


o 




u 



35/91 



wo 2004/061616 



PCTAJS2003/041613 



1 MTDSIPLQPV RHKKRVDSRP RAGCCEWLRC CGGGEPRPRT VWLGHPEKRD QRYPRNVINN 
61 QKYNFFTFLP GVLFSQFRYP FNFYFLLLAC SQFVPEMRLG' ALYTYWVPIjG FVLAVTIIRE 
121 AVEEIRCYVR DKEMNSQVYS RLTSRGTVKV KSSNIQVGDL ILVEKNQRVP ADMIFLRTSE 
181 KNGSCFLRTD QLDGETDWKL RLPVACTQRL PTAADLLQIR SYVYAEKPNI DIHNFLGTFT 
241 RENSDPPISE SLSIENTLWA GTVIASGTW GWLyTGRKCj RSVMNTSDPR SKIGLFDLEV 
301 NCLTKILFGA LVWSLVMVA LQHPAGRWYL QIIRFLLLFS NIIPISLRVN LDMGKIVYSW 
361 VIRRDSKIPG TWRSSTIPE QLGRISYLLT DKTGTLTQNE MVFKRLHLGT VAYGLDSMDE 
421 VQSHIFSIYT QQSQDPPAQK GPTVTTKVRR TMSSRVHEAV KAIALCHNVT PVYESNGVTD 
481 QAEAEKQPED SCRVYQASSP DEVALVQWTE SVGLTLVGRD QSSMQLRTPG DQVLNLTILQ 
541 VFPFTYESKR MGIIVRDEST GEITFYMKGA DWMAGIVQY NDWLEEECGN MAREGLRVLV 
601 VAKKSLTEEQ YQHFEARYVQ AKLSVHDRSL KVATVIESLE MEMBLLCIiTG VEDQLQADVR 
661 PTIiETLRNAG IKVWMLTGDK LETATCTAKN AHLVTRNQDI HVFRLVTNRG EAHLELNAPR 
721 RKHDCALVIS GDSLEVCLKY YEYEPMELAC QCPAWCCRC APTQKAQIVR LLQERTGKLT. 
781 CAVWDGGNDV SMIQESDCGV GVEGKEGKQA SLAADFSITQ FKHLGRLLMV HGRNSYKRSA 
841 ALSQFVIHRS LCISTMQAVF SSVFYFASVP LYQGFLIIGY STIYTMFPVF SLVLDKDVKS 
901 EVAMLYPELY KDLLKGRPLS YKTFLIWVLI SIYQGSTIMY GALLLFESEF VHIVAISFTS 
961 LIIiTELLMVA LTIQTWHWLM TVAELLSLAC YIASIjVFIiHE FIDVYFIATL SFLWKVSVIT 
1021 XiVSCLPLYVL KYLRRRFSPP SYSKLTS 

Figure 35 



1 MTDNIPIiQPV RQKKRMDSRP RAGCCEWLRC CGGGEARPRT VWLGHPEKRD QRYPRNVINN 
61 QKYNFFTFLP GVLFNQFKYF FNLYFLLLAC SQFVPEMRLG ALYTYWVPLG FVLAVTVIRE 

.121 AVEEIRCYVR DKEVNSQVYS RLTARGTVKV KSSNIQVGDL IIVEKNQRVP ADMIFLRTSE 
181 KNGSCFLRTD QLDGETDWKL RLPVACTQRL PTAADLLQIR SYVYAEEPNI DIHNFVGTFT 
241 REDSDPPISE SLSIENTLWA GTWASGTW GWLYTGREL RSVMNTSNPR SKIGLFDLEV 
301 NCLTKILFGA LVWSLVMVA LQHFAGRWYL QIIRFLLLFS NIIPISLRVN LDMGKIVYSW 
361 VIRRDSKIPG TWRSSTIPE QLGRISYLLT DKTGTLTQNE MIFKRLHLGT VAYGLDSMDE 
421 VQSHIFSIYT QQSQDPPAQK GPTLTTKVRR TMSSRVHEAV KAIALCHNVT PVYESNGVTD 
481 QAEAEKQYED SCRVYQASSP DEVALVQWTE SVGLTLVGRD QSSMQLRTPG DQILNFTILQ 
541 IFPFTYESKR MGIIVRDEST GEITFYMKGA DWMAGIVQY NDWLEEECGN MAREGLRVLV 
601 VAKKSLAEEQ YQDFEARYVQ AKLSVHDRSL KVATVIESLE MEMELLCLTG VEDQLQADVR 
661 PTLETLRNAG IKVWMLTGDK LETATCTAKN AHLVTRNQDI HVFRLVTNRG EAHLELNAFR 
721 RKHDCALVIS GDSLEVCLKY YEYEFMELAC QCPAWCCRC APTQKAQIVR LLQERTGKLT 
781 CAVGDGGNDV SMIQESDCGV GVEGKBGTOA SLAADFSITQ FKHLGRLLMV HGRNSYKRSA 
841 ALSQFVIHRS LCISTMQAVF SSVFYFASVP LYQGFLIIGY STIYTMFPVF SLVLDKDVKS 
901 EVAMLYPELY KDLLKGRPLS YKTFLIWVLI SIYQGSTIMY GALLLFESEF VHIVAISFTS 
961 LILTELLMVA LTIQTWHWLM TVAELLSLAC YIASLVFLHE FIDVYFIATL SFLWKVSVIT 

1021 LVSCLPLYVL KYLRRRFSPP SYSKLTS 



. Figure 36 



36/91 



wo 2004/061616 



PCT/US2003/041613 



ATGACGQACaUlCATCCaSCTGCAGCCGGTGCGCCaVG^ 
CGCGCCGGGTGCTGCGAGTOGCTGAGA^^ 

GTCTGGCTGGGGCACCCCGAGAAGAGAGACCAGAGGTATCCTCGGAATGTCATCAACAAT 
CAGAAGTACy^TTTCTTOVCCTlTCTTCCTGGGGTGCTGTTC^ 

TTCAACCTCTATTTCTTACTTCTTGCCTGCTCTCAGTTTGTTCCCGAAATGAGACra 

GCACTCTATACCTACTGGGTTCCCCTGGGCTTCGTGCTGGCCGTCACTGTCATCCGTGAG 

GCGGTGGAGGAGATCCGATGCTACGTGCGGGACAAGGAAGTCy^CTCCCAGGTCTACAGC 

CGGCTCACAGCACGAGGCACAGTGAAGGTGAAGAGTTCTAACATCCAAGTTGGAGACCTT 

ATCATCGTTGAAAAGAACCAGCGGGTCCCTGCCGACATGATCTTCCTGAGGACATCAGAA 

AAAAACGGGTCATGCTTCTTGCGGACGGATCAGCTGGATGGGGAGACGGACTGG3AGCTC 

CGGCTTCCCGTGGCCTGCy^CGCAGAGGCTCCCaVCGGCCGCCGACCTTCTTCAaATTCQA 

TCGTATGTGTAOSa^GAAGAGCCAAATATTQACATTCACAACTTCGTC 

CQAGAAOACAGCGACCCCCCGATCAGaSAGAGCCTGAGCy^TAQAGAACAaSCTGTC 

GGO^CTGTGGTCGCATaVGGTACTGTTGTGGGTQTTQTTCTTTACACrG 

OSGAGTGTCATGAATACCTCaVAATCCCCGAAGTAAGATCXSQCCTGTTCGACTTGGAAGTG 

AACTGCCTCACCAAGATCCTCTTTGGTGCCCTGGTGGTGGTCTCGCTGGTC^ 

CTTCAGCyVCTTTGCAGQCCGTTGGTACCTGCAGATCATCCGCTTCCTCCTCrCTC 

AACa.TCATCCCCATTAGTTTGCGTGTGAACCTGGACATGGGCA?lGATCGTGTACAGC 

GTGATTCGAAGGGACTCGAAAATCCCCGGGACCGTGGTTCGCTCCAGCACGATTCCTGAG 

CAGCTGGGCAGGATTTCGTACTTACTCACAGACAAGACAGGCACTCTTACCCAGAACGA 

ATGATTTTCAAACGGCTCCATCTCGGAACAGTAGCCTACGGCCTCGACTCAATGGACGAA 

GTACAAAGCCACATTTTCAGCATTTACACCCAGCAATCCCAGGACCCACCGGCTCAGAAG 

GGCCCAACGCTCACCACTAAGGTCCGGCGGACCATGAGCAGCCGCGTGCACGAAGCCGTG 

AAGGCCATCGCGCTCTGCCACAACGTGACTCCCGTGTATGAGTCCaUVCGGTGTGACTC 

CAGGCnXSAGGCOSAGAAGCAGTACGAAGACTCCTGCCGCGTATACCAGGa 

GATGAGGTGGCCCTGGTACy^GTGQACGGAAAGTGTGGGCTTAACCCTGGTGGGCCGAGAC 
C^TCT^CCATGCAGCTG 

ATCOTCCCTTTCACCTATGAAAGCAAACGTATGGGO^TCATCGTC^ 
GGAGAAATTACGTTTTACATGAAGGGAGCAGATGTGGTCaVTGGCTGGC^ 

AATGACTGGTTGGAGGAAGAGTGTGGCAACATGGCCCGAGAAGGGCTGCGGGTGCTCGTG 
GTGGCAAAGAAGTCTCTTGCAGAGGAGCAGTATCAGGACTTT^^ 

GCCAAGCTGAGTGTGCACGACCGCTCCCTCAAAGTGGCCACGGTGATCGAGAGCCTGGAG 

ATGGAGATGGAACTGCTGTGCCTGACGGGCGTGGAGGACCAGCTGCAGGCAGATGTGCGG 

CCCACGCTGGAGACCCTGAGGAATGCTGGCyiTCAAGGTTTGGATGCTGACAGGGGACAAG 

CTGGAGACAGCTACGTGCACAGCGAAGAATGCACATCTGGTGACCAGAAACCAAGA 

CACGTTTTTCGGCTGGTGACCAACCGCGGGGAGGCTCACCTCGAGCTGAACGCCTTCCGC 

AGGAAGCATGATTGTGCCCTGGTCATCTCGGGAGACTCCCTGGAGGTTTGCCTC^^ 

TATGAGTACGAGTTCATGGAGCTGGCCTGCCAGTGCCCGGCCGTAGTCTGCTGCCGATGT 

GCCCCCACCCAGAAGGCCCAGATCGTGCGCCTGCTTCAGGAGCGavCGGGC^^ 

TGTGCAGTAGGGGACGGAGGCAATQACGTCAGCATGATTCAGGAATCTmC^ 

GGAGTGGAAGGAAAGGAAGGAAAAGAGGCTTCGTTGGCTGCAGACTTC 

TTTAAGGATCTTGGCOMTTGCTTATGGTGCATGGCa^^ 

GCCCrCAGCaVGTKXSTGATTCAa^GGAGCCTCTGTATCAGC^ 

TCCTCCGTGTTTTACTTTGCCTCOSTCCCTCTC^ 

TCCACa^TTTACACCATGTTTCCTGTGTTTTCTCTGGTCCTGGACAAAGATGTC^^ 
GAAGTTGCCATGCTGTATCCTGAGCTCTACAAGGATCTTCTC7UVGGGACGGCCGTTGTCC 
TACAAGACATTCTTAATATGGGTTTTGATTAGCATCTATCAAGGGAG^ 
GGGGCGCTGCTGCTGTTTGAGTCGGAGTTCGTGCACATCGTGGCCATCTCCTl^ • 
CTGATCCTCACCGAGCTGCTCATGGTGGCGCTGACCATCCAGACCTGGCACTGGCTCATG 
ACAGTGGCGGAGCTGCTCAGCCTGGCCTGCTACATCGCCTCCCTGGTGTTCTTACACGAG 
TTCATCGATGTGTACTT^CATCGCCACCTTGTCATTCTTGTGGAAAGTCTCCGT 
CnX3GTCAGCTGCCTCCCCCTCTATGTCCTCAAGTACCTGCX3AAGACGGTT 
AGCTACTCAAAGCTCACATCA 



Figure 37 



37/91 



wo 2004/061616 



PCTAJS2003/041613 



1 GGGAAGCTGT TGCGCACCAC TTAGCTGGGA AGTGCGTTGC TCCCTGTTTC CCAGCCCACC 
61 CGAGATGGCC CCCAAAGTCT CGGATTCCGT GGAACAGCTC CGCGCTGCCG GCAACCAGAA 
121 CTTCCGCAAT GGCCAGTACG GCGAAGCTCG GCGCTGTACG AGCGCGCACT GCGGCTGCTG 
181 CAGGCGCGAG GTAGGAACCC GCCCCACGTT TCCCTCCGGG CCTGCGTCCT CCACCCGCa.T 
241 CCCCGCACCG GGCCTCCCGT TGGCCCAGCC TCCCTGGTTT TCCCTTCCCC GCGTCCAGCC • 
301 GCCGCACCAG GCCCTCCCAG GGCTTGACCC CGCGATTCTT TCCGTCCCTG GCCX3CCTAGC 
361 CGCGGCCCGG TCTACCATCA CCACCCCCCA CCACCCCCAG. GCCAGTCGGC TGCGGGCCTT 
421 AAGGGCACGC ATCCGCTGCT TCCACCCGGA AGCTGTTGCG CACTCCTCGG CGGG6AACGG 
481 AGGTGGTCCT TGTTTGCCGG CCTCCCGGGA TGGCCCCCAA ACTCTCAGAC TCTGTGGAAG 
541 AGCTCCGCGC AGCCX3GCAAC CAGAGTTTCC GCAACGGACA GTACGCCGAG GCTTCGGCGC 
601 TGTACGAGCG CGCGCTGCGA CTGCTGCAGG CGCGAGGTTC TGCAGACCCC GAAGAAGAAA 
661 GTGTTCTGTA CTCCAACCGT GCAGCGTGCT ACTTGAAGGA TGGGAACTGC ACAGATTGCA 
721 TCAAAGATTG CACTTCCGCG CTGGCCTTGG TTCCCTTCAG CATCAAGCCC TTGCTGCGCA 
781 GAGCATCTGC ATATGAAGCC CTGGAGAAGT ACGCCCTGGC CTACGTTGAC TATAAGACTG 
841 TGCTGCAGAT CGATAACAGT GTGGCATCCG CCCTGGAAGG CATCAACAGA ATAACCAGAG 
901 CTCTCATGGA CTCCCTGGGA CCTGAGTGGC GCCTGAAGCT GCCCCCTATC CCTGTGGTGC 
961 CTGTTTCAGC CCAGAAGAGA TGGAATTCCT TGCCTTCAGA TAACCACAAA GAGACAGCTA 
1021 AAACCAAATC CAAAGAAGCC ACAGCTACGA AGAGCAGAGT GCCTTCTGCT GGGGATGTCG 
1081 AGAGAGCCAA AGCTCTGAAG GAAGTUIGGCA ATGACCTTGT AAAGAAGGGC AACCATAAGA 
1141 AAGCTATTGA GAAGTACAGT GAGAGCCTCT TGTGTAGTAG CCTGGAGTCT GCCACATACA 
1201 GCAACAGAGC GCTCTGTCAC CTGOTCCTGA AGCAGTACAA GGAGGCAGTA AAGGACTGCA 
1261 CAGAAGCCCT CAAQCTGGAT GGGAAGAATG TAAAGGCGTT TTACAGACGG GCTCAAGCCT 
1321 ACAAGGCACT CAAGGACTAT AAGTCAAQCC TTTCGGATAT CAGCAGCCTC CTACAAATTG 
1381 AACCCAGGAA TGGCCCTQCA CAGAAGTTAC GGCAGGAAGT TAACCAGAAC ATGAACTAAA 
1441 CCGTAGAGGG CAACAGGGAC CCTGAACTTG ACCTTCCCAG AGAAGCCAGG GCCTCCCTTG 
1501 CATCTGCCCC AATGCCCAGC ATGCCGCCAA GGGAGTGCAA AATCAACCCC ACTTTGACTC 
1561 CTTGiSAGAGG TAGCAGCCTT TCACCTGACA CATTTTACTT GTTCAGATTA AGTCCATTAC 
1621 AGACAAGCAC AGGACTCTTT TTTTTTTTCT TCTTTTTTTT TTCCAGAAAG GTCCCCACTA 
1681 GAGGTTTTTG TTTTGTTTTA TTTTTAATTT AAAAAAGCGT GACGCCAACA GCCCTGGCCT 
1741 CATTCGCTTG CTTCTGCCTG GCCCTTGTCA ACACAGTCCT TGGCAACTGT CCCTGACCCA 
1801 GATATGCACA GACTGGGTGC CTGTGACTTC CTCTGCCGCC ATAGCTCTGC AGTTCACCTG 
1861 AGTGCTGACA GGCTAGAAGT GCTTGCTCGT CCGCAGCCAC AGCGGCCTGT TGAGCTGGTT 
1921 CTCCAAGGCT GCCTGCCATC TCCTCGAGGA GACAGCTGCT GTCTGCACCC TGTCCTTGAC- 
1981 ACAGTGTCCT GTGTTQAGCC CCAGTGCCTT TAGTCCAGGC CCTTTGTGGG AAGGCAGAGC 
2041 CTAACCCTTG GAGGCTCTGT GTTGTTGCCT TCTGTCTGAG CTACCTACGA TGTTCAAAGA 
2101 GCCCAGATTG CTCCTGCAAT GGGGAGAGAG GCCTCCTTOA GATTAGTGTC CCTCCAGTCT 
2161 GA GCAG GAAC TTAACCTTTT CCCCCATAGC AGCAGCCCCT CGGGCTCCTT TGTTTTGTTT 
2221 TGTTTTGTTA ATATGTTGGA GTTAATTGAA CTGATTTTAT TGAAGTGTGT GTTGCTGTTG 
2281 CATTAA7VAGG TTTTCTTCTA TG 



Figure 38 



38/91 



wo 2004/061616 



PCT/US2003/041613 



\f CO 

CO CO 

• • • • 

o o 

u u 

cn cn 



9 

z 

m 



in 
o 



0^ 

CM 



O 
OD 
O 

a 



1^ c 

O -rr 



O 
O 

O 



CD 

E 
o 

Q 

a 
I- 



(D 

-M 

Q 
(D 
Q 
O 
U 



tn 

CD 
Q) 
Q 
0) 




CO 


a 


(0 


X 


X 


X 


E 


E 


E 
















CO 












O 


o 




H 





O tH O -r-l 

rN (N a 



0) 

> 

U 

o 



Q 
CD 
Z 

0) 

> 
I— I 

CD 
O 

z 



z 

E 



S-.=:-.v 

It 

m 



> 

£ 



Bo 

iZ. w 



o 

E 
o 



Q. 



»- : 

: 4 • 



u 
c 

!!2 " u 



c 

■<5 

E 
o 

□ 





Q 
UJ 

*fe Q 

i.a:.-'^ o 



VD 
O 



< 

z 

9 

z 
lij 

CN o 

So 

CO ^ 
Oo 

Z(N 



00 • 

8 



c 

CD 

E 
o 

Q 

a: 



I 



< 

Z 

E 



c 
o 

i. 



C 

E 
o 

Q 



CO 



1^ 



39/91 



wo 2004/061616 



PCT/US2003/041613 



CQ 



E 
£ 



00 00 



I 



'f 



t ™ I. 



" ' ^ s § 
<D' cn ^ 



4^ - 






31 

:'..o .'^H 



CO 

c 
o 
Jl 
u 

o 



L. 

O 



0) 
til 

s 




o 

o E £^ 

8 

c 
c; 
■o 



M 
CO 

o 



■5J< 
5S 





4-2 






3 

O 



u 
c 

c 



2 

c 
c 

0) 05 2 
c o 



o 



c 
o 




o 



go 

<00 



C8 



VO ON 
00 CO 

8 8 

-31 

■a 



rsf 

CO 

o 
X 



o 
Q 

a <u 



0)1 

to 

(D 

Q. 

CD 



4-f 



O ^ ^ 5 



(Ui= CO 



a 

E 
o 



On 

en 

CD 



40/91 



wo 2004/061616 



PCT/US2003/041613 



MAPKLSDSVE ELRAAOTQSP iasrCQYAEASA 
YLkbONCTDC IKDCTSALAL VPFSIKPLLR 
ALEGINRITR ALMDSLGPEW RLKLPPIPW 
KSRVPSAGDV ERAKALKEEG NDLVKKGNHK 
KQYKBAVKDC TEALKIiDGKN VKAFYRRAQA 
RQEVNQNMN 



LYERAIiRliLQ ARGSADPEEE SVLYSNRAAC 
RASAYEALiEK ' YALAYVDYKT VLQIDNSVT^ 
PVSAQKRWNS LPSDNHKETA KTKSKEATAT 
KAIEKYSESL LCSSLESATY SNRALCHLVL 
YKALKDYKSS LSDISSLLQI EPRNGPAQKL 



Figure 40 



MAPKFPDSVEELRAAGNESFRNGQYAEASALYGRAIiRVIiQAQGSSDPEEESVLYSI^^ 

IKDCTSALALVPFSlKPIJ^RRASAYEAIiEKYPMAYVDYKTVLQIDDN^ 

Rl^PSIPLVPVSAQKRWNSLPSEiraKEMAKSKSKETTATKNRVPSAGDVEKI^ 

KAIEKySESLLCSNLESATYSNRALCYLVLKQYTEAVKDCTEALKLDGKNVK^ 

FADISNLLQIEPRNGPAQKLRQEVKQNLH 

Figure 41 

i GGCACGAGGC ACCACACGGG GGAGGAAGGA AGGAGCTCCC AACTCGCCGG CCTGGCCACG 
61 GGATGGCCCC CAAATTCCCA GACTCTGTGG AGGAGCTCCG CGCCGCCGGC AATGAGAGTT 
121 TCCGGAACGG CCAGTACGCC GAGGCCTCCG CGCTCTACGG CCGCGCGCTG CGGGTGCTGC 
181 AGGCGCAAGG TTCTTCAGAC CCAGAAGAAG AAAGTGTTCT CTACTCCAAC CGAGCAGCAT 
241 GTCACTTGAA GGATGGAAAC TGCAGAGACT GCATCAAAGA TTGCACTTCA GCACTGGCCT 
301 TGGTTCCCTT CAGCATTAAG CCCCTGCTGC GGCGAGCATC TGCTTATGAG GCTCTGGAGA 
361 AGTACCCTAT GGCCTATGTT GACTATAAGA CTGTGCTGCA GATTGATGAT AATGTGACGT 
421 CAGCCGTAGA AGGCATCAAC AGAATGACCA GAGCTCTCAT GGACTCGCTT GGGCCTGAGT 
481 GGCGCCTGAA GCTGCCCTCA ATCCCCTTGQ TGCCTGTTTC AGCTCAGAAG AGGTGGAATT 
541 CCTTGCCTTC GGAGAACCAC AAAGAGATGO CTAAAAGCAA ATCCAAAGAA ACCACAGCTA 
601 CAAAGAACM AGTGCCTTCT GCTGGGGATG TGGAGAAAGC CAGAGTTCTG AAGGAAGAAG 
661 GCAATGAGCT TGTAAAGAAG GGAAACCATA AGAAAGCTAT TGAGAAGTAC AGTGAAAGCG 
721 TCTTGTGTAG TAACCTGGAA TCTGCCACGT ACASCAACAG AGCACTCTGC TATTTGGTCC 
781 TGAAGCAGTA CACAGAAGCA GTGAAGGACT GCACAGAAGC CCTCAAGCTG GATGGAAAGA 
841 ACGTGAAGGC ATTCTACAGA CGGGCTCAAG CCCACT^GC ACTCAAGGAC TATAAATCCA 
901 GCTTTGCAGA CATCAGCAAC CTCCTACAGA TTGAGCCTAG GAATGGTCCT GCACAGAAGT 
961 TGCGGCAGGA AGTGAAGCAG AACCTACACT AAAAACCCAA CAGGGCAACT GGAACCCCTG 
1021 CCTGACCTTA CCCAGAGAAG CCATGGGCCA CCTGCTCTGT GCCCGCTCCT GAAACCCAGC 
1081 ATGCCCCAAG TGAGCTCTGA AGCCCCCTCC TCAATCCCTT GATGGCCTCC CACCCTGTAA 
1141 GAGGCTTTGC TTGTTCAAAT TAAACTCAGT GTAGTCAAAC ACAGACATGG TTGTTGCACC 
1201 AGAAAGGTCC CCACTAGAGC TAAGCGTGAA GCTGAAGCTC TGTCCCTATT CCCCCAGCCC 
1261 AGCTAGCTGA TCACACCAAC AGATCCTCAT CAGCAAAGCA TTTGGCTTTG TCCTGCCCAA 
1321 GTGGGCTGCA GACTGAGTGC TGCCCTTGTA GCTTCCCCAG ACCCCAACTC ACTGCAGTTC 
1381 ATCTGAACAA CCTGAGCTCC TGGGCCGGGG TGGAAGGAGG GGGATAAACC TAAGGCCCTQ 
1441 ATCCAAAGCA GGCTGTT6AG CTGGTTCTCC AGGGCTGCAG TCTCTCCAGG TGTACAGCTO 
1501 CTGTCCCTGC CCTGTCCTGT CCTTGCACAQ TCTCCTATGT CTGAGCCCCA GTGCCTTCTG 
1561 TTCGGGCCCT CCTTTGGTGG GAAGGCAGAQ CCCTGACCCT T6AATGGTTG TCCTTGACTC 
1621 TGTGCTGCTG CCTTCTGCAG AGAGGCACCT AAGCTGTTTA AAGAGCCCAG TGATTGTGGC 
1681 TGCTCCTCCT AGAGGTGGGA GGGGGCAAGA GGCCTCCTTG GTCAGTGTCC ATGCTTTCTG 
1741 GGCAGGGACT TGGTTTTTTG TTCCAACAGT GGCCTTCTCC GGGCTTCATA GTTCTTTGTA 
1801 ATATGTTGAA GTTAATTTGA ATTGACTGAT TTTGTTGAAC TGTGTGTTTA AGCTGTTGCA 
1861 TTT^AAAAGCT TTCTTCTACA TGAATATCTG CTGTGCTTTC ATTTATGCCT TTTCAGCTTT 
1921 GCACCTGGAA CTCTGTAGTA ATAATAAAAG TTATTGCTTA TTGGGCATTC AAAAAAAAAA 
1981 AAAAAAAA 

F^re 42 



41/91 



wo 2004/061616 



PCT/US2003/041613 



MAPKVSDSVE QLRAAG^QNF. BNGQYGEASA XiYERALRXiXiQ AR6SADPBB& SVLYSNRAAC. 
YLKBGNCTDC IKDCTSALAL VPFSIBCPLIiR RASAYEALEK YMiAYVDYKT VLQIDNSVAS 
AIiEGINRITR ALMDSIiGPEW RLKLPPIPW PVSAQKRWNS LPSDNHKETA KTKSKEATAT 
KSRVPSAGDV ERAKALKEEG NDLVKKGNHK KAIEKYSESL LCSSLESATY SNRALCHLVL 
KQYKEAVKDC TEALKLDGKN VKAFYRRAQA YKALKDYKSS LSDISSLLQI BPRNGPAQKIi 
RQEVNQNMN 

Figure 43 



MAPKFPDSVE ELRAAGNESP RNGQYAEASA LYGRALRVLQ AQGSSDPEEE SVLYSNRAAC 
HWKNGNCM)C IKDCTSALAL VPFSIKPLLR RASAYEALEK YPMAYVDYKT VLQIDDNVTS 
AVBGINR^TTR ALMDSLGPEW RXiKLPSFPLV PVSAQKRWNF LPSE13HKEMA KSKSKETTAT 
KNRVPSAGDV EKARVLKEEG NELVKKGNHK KAIEKYSESIi LCSNLBSATY SNRALCYIiVL 
KQYTEAVKDC TEALKLDGKN VKAFYRRAQA HKALKDYKSS PADISNLLQI EPRNGPAQKL 
RQEVKQmiH 



Figure 44 



42/91 



wo 2004/061616 



PCTAJS2003/041613 



1 GACTGGCTGQ TGCGGGAAAT ATGCAGGAGA AAAGTCTTTG CATAATGTAG AGCGAGCCGT 
61 GGGGCTCCGG GAGCGGCGCC CCAAGGTCTG GGGCCATGAA CGCGAGCGTG GAAGGAGACA 
121 CCTTTTCTGG ATCGATGCAA ATCCCAGGAG GCACCACGGT CGTGGTGGAG CTGGCACCGG 
181 ACATCCACAT CTGCGQCCTC TGTAAGCAGC ACTTCAGCAA TCTGGATGCC TTTGTGGCCC 
241 ACAAACAGAG CGGCTGCCAG CTGACTACCA CGCCGGTGAC AGCCCCCAGC ACGGTCCAGT 
301 TTGTGGCAGA GGAGACAQAG CCTGCCACCC AGACCACCAC AACGACCATC AGTTCAGAGA 
361 CTCAfiACTAT CACAGTTTCA GCTCCAGAGT TCGTCTTTGA ACATGGCTAC CAAACTTACC 
421 TQCCCACGGA GAGCACTGAC AACCAGACAG CCACCGTGAT CTCTCTCCCC ACCAAGTCAC 
481 GCACCAAAAA GCCCACAGCA CCCCCTQCTC AGAAGAGACT CGGCTGCTGC TATCCAGGTT 
541 QCCAGTTCAA GACCQCCTAT GGCATGAAGG ACATGGAGCQ ACACCTGAAG ATCCACACCQ 
601 GTGACAAACC CCACAAGTGT GAGGTGTGCG GGAAGTGCTT CAGCCGGAAG GACAAGCTGA 
661 AGACGCACAT GCGCTGCCAC ACGGGCGTCA AGCCCTACAA GTGCAAGACG TGCGACTACG 
721 CGGCGGCGGA CAGCAGCAGC CTTAACAAGC ACCTGCGCAT CCACTCGGAC GAGCGACCCT 
781 TCAAGTGCCA GATCTGTCCC TACGCCAGCC GCAACTCCAG CCAGCTCACC GTGCACCTGC 
841 GCTCGCACAC GGGGGACGCC CCCTTCCAGT GCTGGCTCTG TAGTGCCAAG TTCAAAATCA 
901 GCTCGGACTT GAAAAGGCAC ATGCGTGTGC ACTCGGGGGA GAAGCCTTTC AAGTGCGAAT 
961 TCTGCAATGT CCGCTGTACC ATGAAGGGGA ACCTCAAATC GCACATCCGC ATCAAGCACA 
1021 GTGGGAATAA CTTCAAGTGT CCGCACTGCG ACTTCCTGGG TGACAGCAAA TCCACCCTGC 
1081 QGAAGCACAG TCGCCTGCAC CAGTCGGAGC ACCCGGAGAA GTGTCCCGAG TGCAGCTACT 
1141 CCTGTTCCAG CAAGGCCGCG CTGCGCGTGC ACGAGCGCAT CCACTGCACC GAGCGCCCGT 
1201 TCAAGTGCAG CTACTGCAGC TTCGATACCA AGCAACCCAG CAACCTGAGC AAGCACATGA 
1261 AGAMTTCCA CGCC6ACATG CTCAAQAACG AGGCTCC!GGA GAAGAAGGAG AGCGGCAGGC 
1321 AGAGCAGCCG GCAGGTGGCC AGGCTGGATG CCAAQAAGAC GTTCCACTGC GACATCTGTG 
1381 ACGCCTCGTT TATGCGGGAG GACTCGCTCC GCAGCCACAA ACGGCAGCAC AGTGAGTACC 
1441 ACAGTAAGAA CTCGGACGTG ACTGTAGTAC AGCTTCACCT TGAACCCAGC AAGCAGCCGC 
1501 TGCGCCCCTC ACCGTAGAGC AAATCCAGGT CCCCCTCCAG TCCAGCCAGG TGCCCCAGTT 
1561 CAGCGAGGGG AGGGTCAAGA TCATCGTGGG GCATTACAGG TGCCTCAGAC GAACCGCCAT 
1621 AGTCCAAGCG GCCGCAGCTG CCGTCAACAT TGTGCCCCCC ACCCTGGTAG CCCAGACCCC 
1681 AGAGGAGATC CCAGGC3AACG GCCGGCTACA GATCCTTCGC CAGGTCAGTC TCATTGCCCC 
1741 TCCTCAGTCC TCCGGGTGTC CCGGCGAAGC AGGTGCCCTG AGTCAGCCAA CTGTCCTGCT 
1801 GACCACCCAT GATCAGACGG CAGGGGCCGC CCTGCAGCAG GCTCTGATCC CCACCACCCC 
1861 GGTTGGGACC CAGGAAGGCA CGGGAAACCA GACATTCATT GCCAGTTCGG GCATCGTGCT 
1921 CGGACTTGGA AGGCCTTAAG CTCTATTCAG GAGGGAACGA CX3GAAGTGAC TGTGGTGAGC 
1981 GATGGGGACC AGAGCATCGC AGTGGCCACC ACGGCACCCT CTATCTTCTC TACCCAGCAG 
2041 GAACTGCCCA AGCAGACTTA CTCCATCATC CACGGGGCGG CACACCCCGC CCTGCTCTGT 
2101 CCC6CCGACT CCATTCCTGA TTAGTCTQGA GGGAGGGGTG ACAGACAAGA CAAACTGCGA 
2161 GAGQAGTACT GTGAGAGGCT CCTGGTCCCG CATAAATAAT TGTATTTTAT ACAGTTTATG 
2221 TAATTTTTTA ACAGGGTATC AAGCTGGAGA CCATTCTCCC TCAAGCTCTT GTTGATTGTG 
2281 TCTTAATGGT TACCAAGGCT GATTCCAATG TGQAGTTGGA ATTCACCACA GTAGGACTGA 
2341 ATACATTCGT TTGTTTTTCC ATGTTTAGGA TTTAATTTTT TTCAACTGGA ATAAAGGA6T 
2401 TTGGGATTTG" GGTTAAAAAA 

Figure 45 



43/91 



wo 2004/061616 



PCT/US2003/041613 



1 GAGTCCTCCC CGCCTCGCAG AGTTGGGAGA AGGCAGGGTQ GGGGGTGTGG AAAAATAAAA 
61 GGAAAAGTCC TTGCACCATG TAGATCAGCG TCCCCCACTT TGGCATCCCG GCCGGCCGGG 
121 GACCTCCCAG TCTGCGGCCA TGAACGCGAG CAGCGAGGGC GAGAGCTTCG CGGGCTCGGT 
181 GCAAATTCCA GGTGGCACAA CGGTGCTGGT GGAGCTGACT CCCGACATCC ATATCTGCGG 
241 CATCTGCAAG CAGCAGTTTA ACAACCTGGA TGCCTTTGTA GCTCACAAGC AAAGTGGCTG 
301 CCAGCTGACA GGCACATCCG CAGCAGCCCC CAGCACGGTC CAGTTTGTAT CGGAGGAAAC 
361 AGTGCCTGCC ACCCAGACTC AGACCACCAC CAGAACCATC ACCTCGGAGA CCCAGACAAT 
421 CACAGTTTCA GCTCCAGAAT TTGTTTTTGA ACATGGCTAT CAAACTTACC TGCCCACGGA 
481 AAGTAATGAA AACCAGACAG CCACTGTCAT CTCTCTCCCT GCCAAGTCAC GCACCAAAAA 
541 GCCCACAACA CCACCTGCTC AGAAAAGGCT TAACTGTTGC TATCCAGGTT GCCAATTCAA 
601 GACTGCTTAT GGCATGAAGG ACATGGAGCG GCATTTAAAA ATTCACACGG GAGACAAACC 
661 CCATAAGTGT GAAGTCTGTG GCAAGTGCTT TAGCCGGAAA GACAAGCTGA AAACTCACAT 
721 GCGGTGCCAC ACGGGCGTGA AGCCCTACAA GTGTAAGACG TGTGACTACG CCGCTGCCGA 
781 CAGCAGCAGC CTCAACAAGC ACCTGAGGAX CCACTCGGAC GAGCGGCCCT TCAAATGCCA 
841 GATCTGCCCC TACGCCAGCC GCAACTCCAG CCAGCTCACT GTCCACCTGC GATCCCACAC 
901 GGGGGACGCC CCCTTCCAGT GCTGGCTCTG TAGCGCCAAQ TTCAAAATCA GCTCGGACTT 
961 GAAAAGGCAC ATGCGGGTGC ACTCGGGGGA GAAGCCTTTC AAGTGCGAGT TCTGCAATGT 
1021 CCGCTGCACC ATGAAGGGGA ACCTCAAGTC GCACATCCGT ATCAAGCAGA GCGGGAATAA 
1081 CTTCAAGTGT CCTCATTGCG ACTTCCTGGG TGACAGCAAA GCCACCCTCC GGAAGCACAG 
1141 CCGCGTGCAC CAGTCGGAGC ATCCTGAGAA GTGCTCGGAA TGCAGCTACT CCTGCTCCAG 
1201 CAAGGCCGCC. CTGCGCATCC ACGAGCGTAT CCACTGCACC GACCGCCCTT TCAAGTGCAA 
1261 CTACTGCAGC TTCGACACCA AACAGCCCAG CAACCTGAGC AAGCACATGA AGAAGTTCCA 
1321 CGGGGACATG GTTAAGACTG AGGCTCTAGA GAGGAAGGAC ACCGGCAGGC AGAGCAGCCG 
1381 GCAGGTGGCC AAGCTGGATG CCAAGAAGAG TTTCCACTGC GATATATGCG ATGCCTCCTT 
1441 CATGCGGGAG GACTCGCTCC GCAGCCACAA GAGACAGCAC AGTGAGTACA ATGAGAGTAA 
1501 GAACTCGGAC GTGACCGTTC TCCAGTTTCA GATCGACCCC AGCAAGCAGC CCGCCACGCC 
1561 CGTCACTGTG GGACACCTCC AGGTGCCCCT CCAGCCCAGC CAAGTGCCCC AGTTCAGCGA 
1621 GGGAAGAGTC AAAATCATCG TTGGGCATCA GGTGCCCCAG GCGAACACCA TCGTCCAGGC 
1681 TGCCGCTGCT GCAGTGAACA TCGTCCCGCC TGCCTTGGTG GCCCAGAACC CAGAGGAACT 
1741 CCCAGGGAAC AGCCGGCTGC AGATCCTGCG CCAGGTCAGT CTGATCGCCC CCCCTCAGTC 
1801 CTCGCGGTGT CCGAGCGAGQ CGGGCQCAAT GACCCAGCCG GCTGTCCTQC TGACCACCCA 
1861 CGAGCAGACG GACGGAGCCA CTCTGCAOCA GACTCTCATC CCCACGGCCT CAGGTGGCCC 
1921 CCAGQAAGGC TCTGGCAATC AAACTTTCAT TACCAGTTCG GGTATTACTT GCACTGACTT 
1981 TGAAGGCCTA AACGCCTTGA TTCAGGAGGG GACAGCAGAA GTGACAGTGG TGAGCGATGG 
2041 AGGCCAGAAC ATCGCAGTGG CCACCACAGC GCCACCGGTC TTCTCCTCCT CTTCCCAGCA 
2101 AGAACTACCC AAGCAGACCT ACTCCATCAT TCAAGGGGCA GCCCATCCAG CTTTGCTCTG 
2161 TCCCGCCGAC TCCATTCCAG ATTAGTGCTT AAAAAACAAA AGGAGTGGGG GAAAGGAATT 
2221 GAGAAAAAGA AATCTTAAGT AGAATTCTCT AAAAGGTTTG CTCTTAATGT TTTCTTTGTT 
2281 TTGTTTTGTT TTTGAGACGG AGTCTCGCTC TGTTTCCCAG GCTGGAGTGC AGTGGCGCTA 
2341 TCTTGGCTCA CTGCAACGTC CGCCTCCCAG GTTCAAGCGA TTCTCATGCC TCGGCCCTCC 
2401 GAGTAGCTGG GACCACAGGT GTACGACATC ATGACTGGCT AATTTTTGTA TATTTAATAG 
2461 AGGCGGGGTT TCATCATGTT GAACTCCTGA CCTCAAGTGA TCTGCCCACC TCAGCCTCCC 
2521 AAAGTGCTGG GATTACAGGT GTGAGCCACC ATGCCTGGCC GTGGTTTGCT CTTAATGTTT 
2581 TTAAGGATGG TTGTGAATCC CCCTGGCCCC ATAATAAATT GTAATTTTAT ACTGCTTACT 
2641 ATAATTTTTT TAACACTGTA ACAACTTTGA GACCACCTCT GAATCGTCGC ATTATAACTG 
2701 TTGTAGAATC TTAAATGTTA CCAAGATQAT TCCAATGAGG GGTT6GAATT AAATGCATTA 
2761 AGTAGTGAAC TCATGTGTTT GTTTCCAACT TGATTTTCCA ACTCTAATAA AGGTTTCTGT 
2821 CCATCTTATT ACATTTGTGT AGTAAATQGT ACTTCCCAGC CTCTCTTTTG CCCCATTCTG 
2881 GAATACTCCC CAGAGTTTGG GGGTGTTCAT GTTTTATACA TGTAAGTCTG TTGGCATGAA 
2941 GGACCATTTT CTACATAATA TGACATGGAT ACTTGACCCA AAAAAAATGT TTAGTGCTAA 
3001 TGAGCAGAAA ATGAATGGTT CCATAATAAA TTGATATCTG ATTAAAAT 



Figure 46 



44/91 



wo 2004/061616 



PCTAUS2003/041613 



MNASVEGDTP SGSMQIPGGT TWVEIiAPDI HICGLCKQHF SNLDAFVAHK QSGCQLTTTP 
VTAPSTVQFV AEETEPATQT TTTTISSETQ TITVSAPEPV FEHGYQTYLP TESTDNQTAT 
VISLPTKSRT KKPTAPPAQK RLGCCYPGCQ FKTAYGMKDM ERHLKIHTGD KPHKCEVCGK 
CFSRKDKLKT HMRCHTGVKP YKCKTCDYAA ADSSSLNKHL RIHSDERPPK CQICPYASRN 
SSQLTVHLRS HTGDAPFQCW LCSAKFKISS DLKRHMRVHS GEKPFKCEPC NVRCTMKGNL 
KSHIRIKHSG NNFKCPHCDF LGDSKSTLRK HSRLHQSEHP EKCPECSYSC SSKAALRVHE 
RIHCTERPFK CSYCSFDTKQ PSNLSKHMKK FHADMLKNEA PEKKESGRQS SRQVARLDAK 
KTFHGDICDA SFMREDSLRS HKRQHSEYHS KNSDVTWQL HLEPSKQPLR PSP 

Figure 47 

MNASVEGDTF SGSMQIPGGT TWVEIiAPDI HICGLCKQHF SNLDAFVAHK QSGCQLTTTP 
VTAPSTVQFV AEETEPATQT TTTTISSETQ TITVSAPEFV FEHGYQTYLP TESTDNQTAT 
VISLPTKSRT KKPTAPPAQK RLGCCYPGCQ FKTAYGMKDM ERHLKIHTGD KPHKCEVCGK 
CFSRKDKLKT HMRCHTGVKP YKCKTCDYAA ADSSSLNKHL RIHSDERPFK CQICPYASRN 
SSQLTVHLRS HTGDAPFQCW LCSAKFKISS DLKRHMRVHS GEKPFKCEPC NVRC1MKGNL 
KSHIRIKHSG NNFKCPHCDF LGDSKSTLRK HSRLHQSEHP EKCPECSYSC SSKAALRVHE 
RIHCTERPFK CSYCSPDTKQ PSNLSKHMKK FHADMLKNEA PEKKESGRQS SRQVARLDAK 
KTPHCDICDA SFMREDSLRS HKRQHSEYHS KNSDVTWQL HLEPSKQPLR PSP 

Figure 48 

MNASVEGDTP SGSMQIPGGT TVLVBLMDI HICGLCKQHF SNLDAFVAHK QSGCQLTTTP 
VTAPSTVQFV AEETEPATQT TTTTISSETQ TITVSAPEPV FEHGYQTYLP TESTDNQTAT 
VISLPTKSRT KKPTAPPAQK RLGCCYPGCQ FKTAYGMKDM ERHLKIHTGD KPHKCEVCGK 
CFSRKDKLKT HMRCHTGVBCP YKCKTCDYAA ADSSSLNKHL RIHSDERPPK CQICPYASRN 
SSQ LTVHLR S HTASVLENDV QKPAGLPAEE SDAQQAPAVT LSLEAKERTA TLGERTFNCR 
YPGCHFKTVH GMKDLDRHLR IHTQDKPHKC EPCDKCFSRK DNLTMHMRCH TSVKPHKCHL 
CDYAAVDSSS LKKHLRIHSD ERPYKCQLCP YASRNSSQLT VHLRSHTGDT PFQCWLCSAK 
FKISSDLKRH MIVHSGEKPF KCEFCDVRCT MKANLKSHIR IKHTFKCLHC AFQGRDRADL 
LEHSRLHQAD HPEKCPECSY SCSNPAALRV HSRVHCTDRP FKCDFCSFDT KRPSSLAKHI 
DKVHREGAKT ENRAPPGKDG PGESGPHHVP NVSTQRAFGC DKCGASPVRD DSLRCHRKQH 
SDWGBNKNSN LVTFPSEGIA TGQLGPLVSV 6QLESTLEPS HDL 

Figure 49 

MNASVEGDTP SGSMQIPGGT TVLVELAPDI HICGLCKQHF SNLDAFVAHK QSGCQLTTTP 
VTAPSTVQFV AEETEPATQT TTTTISSETQ TITVSAPEFV FEHGYQTYLP TESTDNQTAT 
VISLPTKSRT KKPTAPPAQK RLGCCYPGCQ FKTAYGMKDM ERHLKIHTGD KPHKCEVCGK 
CFSRKDKLKT HMRCHTGVKP YKCKTCDYAA ADSSSLNKHL RIHSDERPFK CQICPYASRN 
SSQLTVHLRS HTAWRCDCLG STKPWVPSLV TT 

Figure 50 



45/91 



wo 2004/061616 



PCTAJS2003/O41613 



MmSSEQBSF AGSVQIPGGT TVLVELTPDI HICGICKQQF NNLDAFVAHK QSGCQIiTGTS 
AAAPSTVQFV SEETVPATQT QTTTRTITSB TQTITVSAPE FVFEHGYQTY IiPTBSNENQT 
ATVISLPAKS RTKKPTTPPA QKRLNCCypG CQPKTAYGMK DMBRHLKIHT GDKPHKCEVC 
GKCFSRKDKL KTHMRCHTGV KPYKCKTCDY AAADSSSLNK HLRIHSDERP PKCQICPYAS 
RNSSQLTVHL RSHTGDAPFQ CWLCSAKFKI SSDLKRHMRV HSGEKPFKCE PCNVRCTMKG 
NLKSHIRIKH SGNNFKCPHC DFLGDSKATL RKHSRVHQSB HPEKCSECSY SCSSKAALRI 
HERIHCTDRP FKCNYCSFPT KQPSNLSKHM KKPHGDMVKT EALERKDTGR QSSRQVAKLD 
AKKSPHCDIC DASFMREDSL RSHKRQHSEY NESKWSDVTV LQFQIDPSKQ PATPLTVGHL 
QVPLQPSQVP QFSEGRVKII VGHQVPQANT IVQAAAAAVN IVPPALVAQN PEELPGNSRIi 
QILRQVSLtA PPQSSRCPSE AGAMTQPAVL LTTHEQTDGA TLHQTIiIPTA SGGPQEGSGN 
QTPITSSGIT CTDFEGIiMAL IQEGTAEVTV VSDGGQNIAV ATTAPPVFSS SSQQELPKQT 
YSIIQGAAHP AIiLCPADSIP D. 

t 

Figure 51 



MNASSBGESF AGSVQIPGGT TVLVBLTPDI HICGICKQQF NNLDAFVAHK QSGCQIiTGTS 
AAAPSTVQPV SEETVPATQT QTTTRTITSB TQTITOCQPK TAYGMKDMER HLKIHTGDKP 
HKCEVCGKCF SRKDKLKTHM RCHTGVKPYK CICTC:DYAAAD SSSLNKHLRI HSDERPFKCQ 
ICPYASRNSS QLTVHLRSHT GDAPFQCWIiC SAKFKISSDL KRHMRVHSGB KPFKCEFCNV 
RCTMKGNLKS HIRIKHSGNN PKCPHCDFLG DSKATLRKHS RVHQSEHPEK CSECSYSCSS 
KAAliRIHERI HCTDRPFKCN YCSFDTKQPS NLSKHMKKFH GDMVKTEALE RKDTGRQSSR 
QVAKLDAKBCS FHCDICDASF MREDSIiRSHK RQHSEYSESK NSDVTVLQFQ IDPSKQPATP 
LTVGHIiQVPI. QPSQVPQFSE GRVKIIVGHQ VPQANTIVQA AAAAVWIVPP ALVAQNPEEL 
PGNSRLQILR QVSLIAPPQS SRCPSEAGAM TQPAVIiLTTH EQTDGATXiHQ TLIPTASGGP 
QEQSGNQTPI TSSGITCTDF EGLNALIQEG TASVTWSDG GQNIAVATTA PPVFSSSSQQ 
BLPKQTYSII QGAAHPALLC PADSIPD 

I1gure52 



MNASSBGESF AGSVQIPGGT TVLVELTPDI HICGICKQQF NNLDAFVAHK QSGCQLTGTS 
AAAPSTVQFV SEETVPATQT QTTTRTITSB TQTITVSAPE FVFEHGYQTY LPTESNENQT 
ATVISLPAKS RTKKPTTPPA QKRUffCCYPG CQPKTAYGMK DMERHLKIHT GDICT»HKCEVC 
GKCFSRKDKL KTHMRCHTGV KPYKCKTCDY AAADSSSIiNK HLRIHSDERP FKCQICPYAS 
RIsrSSQLTVHL RSHTGDAPFQ CWLCSAKFKI SSDXiKRHMRV HSGEKPFKCE FCNVRCTMK6 
iniiKSHIRIKH SGmWKCmC DPLGDSKATIi RKHSRVHQS6 HPEKCSECSY SCSSKAALRI 
HERIHCTDRP FKCNYCSPDT KQPSNLSKHM KKFHGDMVKT EALERKDTGR QSSRQVAKLD 
AKKSPHCDIC DASFMREDSL RSHKRQHSEY SESKNSDVTV LQFQIDPSKQ PATPLTVGHL 
QVPLQPSQVP QFSEGRVKII VGHQVPQANT IVQAAAAAVN IVPPALVAQN PEBLPGNSRL 
QILRQVSLIA PPQSSRCPSE AGAMTQPAVL LTTHEQTDGA TLHQTLIPTA SGGPQSGSGN 
QTFITSSGIT CTDFEGIjNAL IQEGTAEVTV VSDGGQNIAV ATTAPPVFSS SSQQELPKQT 
YSIIQGAAHP ALLCPADSXP D 



Figure 53 



46/91 



wo 2004/061616 



PCT/US2003/O41613 



C 



80 



Ortholog identification module 



eQTL identification module 



cQTL identification module 



Determination module 



Classification module 



5402 
5404 
5406 
5408 
5410 



FIG. 54 

47/91 



wo 2004/061616 PCT/US2003/041613 



CD 



X 




XII 



u 




SI 
Is) 

u 

4i 




SI 

O 09 



48/91 



wo 2004/061616 PCTAJS2003/041613 




SI 

Re] 

a 



1^ 



Ah 




49/91 



wo 2004/061616 



PCT/US2003/041613 




50/91 



AVO 2004/061616 



PCT/US2003/041613 




51/91 



wo 2004/061616 



PCT/US2003/041613 




wo 2004/061616 PCTAJS2003/041613 




53/91 



wo 2004/061616 PCTAJS2003/041613 



(N 

O O 




54/91 



wo 2004/061616 



PCT/US2003/041613 




55/91 



1 



wo 2004/061616 PCT/US2003/041613 





56/91 



wo 2004/061616 



PCT/US2003/041613 




57/91 



wo 2004/061616 PCT/US2003/041613 




OV 9 2 p Z 

ajODSPOl 



58/91 



wo 2004/061616 



PCTAJS2003/041613 



Jl 




59/91. 



wo 2004/061616 



PCT/US2003/0416I3 



< 

o 



o 

o 

CO 

<: 




rv 

o 

CO 

(» 
O 



d 

x5 



CO 



r44 



>> • 



O 

ID 



c3 
O 

CO 



OA 

b 



o 



o 

^^^^^ 

I 

d 
o 



C/2 



O 
H 

o 



CO 

d 
o 
















t 



03 




60/91 



wo 2004/061616 



PCT/US2003/041613 



9^ 

11 

S3 

A A 

33 

11 
pig 




00 



S9f9 lOD 01} 



£ini ma m oip 



61/91 



wo 2004/061616 



PCTAJS2003/a41613 



6902 



Therapeutic Area / Disease 



6906 



Identify human population, collect femily- 

based sanqjle with disease rekted 
phenotypes, relevant tissue(s) for expression 
profiling and g@nome-wide genotyping data 

P EG 




6904 



Identify inbred strains discordant for 
phenotype of interest, construct, phenotype 
and genotype g^etic cross, collect relevant 
tissue(s) for expression profiling 
PEG 



Profile individuals to identify a 
disease associated pattern 
P + E=S^ DAP 






Evaluate underlying genetics of 
genes involved in pattern 
DAP + G 







6908 



6910 



6914 



6912 



filtersect genetics of pattern wifli 
genetics of disease related traits to 
identify key drivers 
P +G n DAP +G=> D 




6916 



Validate D fhrou^ association study in an 
independent sample 




Validate D ftrougji advanced crosses, 
congenic strains or similar model systems 



6918 



Use synteny to inform selection of 
targets 



Identify subset of druggable targets 



I 



6920 



6922 



Validate targets via knock-out 

models and/or RNAi techniques 



Fig. 69 



62/91 



wo 2004/061616 PCTAJS2003/041613 



LOD Score Plots with Normai Gene X Expression 



S 




LOD score curve for gene X 

LOD score curves for genes Yl, 
Y2, Y3, and Y4 attributed to QTL 
(given by gene X) 

Physical location of gene X 



LCD Score Plots with Gene X RNAi Knockout 



1 




LOD score curves for genes X, Yl, 
Y2, Y3, and Y4 after siRNA knock 
down of gene X 



Fig. 70 

63/91 



wo 2004/061616 



PCTAJS2003/041613 



• ^710 2 

Select a trait, optionally expose a portion of a 
plurality of organisms to a perturbation that affects the trait 



I ^7104 



Measure gene expression / cellular constituent level data 50 
in the secondary tissue of a plurality of organisnris 46 

I ;^7106 



Transfonm gene expression / cellular constituent level data 
50 into expression statistics 

I ^715 0 



Measure one or more phenotypes for ail or a portion of the 
organisms 46 in the pluraiify of organisms 



I ^7152 



Classify the plurality of organisms into distinct phenotypic 
groups based on the phenotypes exhibited by the organisms 



'■ ^71 54 



Identify the phenotypic extremes for the subpopulation with 
respect to the trait under study or a phenotype related to the 
trait under study 

^ ^7156 



Filter the cellular constituent data to identify which cellular 
constituents discriminate the organisms into the phenotypic 
extremes identified In step 7154 (e.g., application of a t-test) 

I ^7158 



Optionally, reduce the number of cellular constituents from 
step 71 56 using a reducing algorithm (e.g., stepwise 
regression, principal component analysis, a stochasitc 
search, etc.) 



I 



^7160 



Optionally, cluster (e.g., k-means clustering) cellular 
constituents from step 7158 (or step 7156) to identify further 
subgroups within each phenotypic subpopulation. 



7164, Fig. 71 B 



FIG. 71A 



64/91 



wo 2004/061616 



PCTAJS2003/041613 



7160, Fig. 71A 

7164 



r 



Use the set of cellular constituents identified as 
discriminators between phenotypic extremes to build a 
classifier 



^7166 



Use the classifier to classify all or a substantial portion of the 
organisms in the population under study, thereby further 
refining the definition of the trait under study 



^7168 



Perform quantitative genetic analysis on each subgroup of 
organisms defined by the classifier developed in step 7166 



FIG. 71B 

65/91 



wo 2004/061616 



PCTAJS2003/041613 





Phenotype 
^ • • • 
1 


Plienotype 
IVI 


CC 
48-1 


• • • 


00 
48-Z 


Organism 46-1 


Amount 
7201-1-1 


■ • • 


Amount 
7201 -1-M 


Level 
50-1-1 


• 1 • 


Level 
50-1 -Z 


Organism 46-2 


Amount 
7201-2-1 


I • • 


Amount 
7201 -2-M 


Level 
50-2-1 


It* 


Level 
50-2-Z 


• 
• 


• 
• 
• 


• 


• 
• 


• 

• 


• 


• 
• 


Organism 46-N 


Amount 
7201-N-1 


• • ■ 


Amount 
7201 -N-M 


Level 
50-N-1 


9 9 9 


Level 
50-N-Z 



FIG. 72 

66/91 



wo 2004/061616 



PCT/US2003/O41613 



7300 



7300 



7310-1 



731 0-N 



7152 




7300 



7160 



7320-1 
7320-M 



7310-N 




7300 




7170 



7300 

7350 

7350 
7350 
7350 

7350 

7350 

7350 



7166 



FIG. 73 



67/91 



wo 2004/061616 



PCTAJS2003/041613 



DNA 




Enviromnental 
Contributors 



Secondary Clinical 
Traits 

Co-Morbidities of 
the Primaiy Disease 



Fig. 74 

68/91 



wo 2004/061616 



PCT/US2003/041613 











O' 













o 




Fie. 75A 

69/91 



wo 2004/061616 



PCT/US2003/041613 



CD 
CQ 

CD 



c 



00 
00 



oo 

l> 00 



55 



a 





70/91 



II II II 



55 



a 



t t t 




s 




V 



V 




wo 2004/061616 



PCT/US2003/041613 



A 



Physical location of HSD1 ^ cis-acting eQTL 




Fig. 76 





Test for 



Causality 




Fig. 77 



74/91 



wo 2004/061616 



PCT/US2003/041613 





Fig. 78 

75/91 



wo 2004/061616 PCT/US2003/041613 



7902 



Genotype a population under study and^ optionally, use pedigree 
information for the population 



1 



7904 



Phenotype the population with respect to a trait or traits of interest 
and map quantitative trait loci (cQTL) for each phenotype, resulting 
in a set of cQTL linked to the trait 



7906 



Obtain abundance data for a plurality of cellular constituents from 
one or more tissues In each member of the population under study 



7908 



Identify cellular constituents (association set D) whose abundance 
levels accross the population significantly associate with the trait of 
interest (e.g., by use of Pearson conrelations, basic discriminant 
analysis, regression models, ete.) 



7910 



For each cellular constituent \ in association set D, perform 
quantitative genetic analysis, in which abundance levels of cellular 
constituent i across the population serve as a quantitative trait, in 
order to identify eQTL for cellular constituent i 



7912 



Remove all cellular constituents from association set D that do not 
have at least one eQTL that is coincident with (within a support 
interval of) a cQTL for the trait of interest in order to form the 
candidate causative cellular constituent set. Optionally, require 
that ail coincident eQTL/cQTL pass a pleiotropy test in order to be 
considered coincident. Cellular constituents removed from 
association set D form a candidate reactive cellular constituent 
set. 




FIG. 79A 

76/91 



wo 2004/061616 PCT/US2003/041613 




7916 



For each cellular constituent I In the candidate causative cellular 
constituent set, detemilne the amount of genetic variation In the 
trait of interest that is explained by the eQTL of cellular constituent 
i coincident with the cQTL from the trait of Interest. Rank order the 
cellular constituents in the candidate causative cellular constituent 
set based upon the amount of genetic variation in the trait of 
Interest that is explained by each cellular constituent determined in 
this manner. 



^7918 



For each eQTL of each cellular constituent i in the candidate 
causative cellular constituent set, test for the relationship: 

P(7,Q|G) = P(T1G)P(Q|G) 

where, 

7 is variance in the trait of interest, 

Q is variance in the genome at the position where the eQTL 
(assoicated with cellular constituent i) overlaps with a cQTL linked 
to the trait of interest, and 

G Is variance in the abundance level of cellular constituent i 



^7920 



Optionally, detenmine whether each cellular constituent i in the 
candidate causative cellular constituent set includes a druggable 
domain 



^792 2 



Optionally, rank cellular constituents in the candidate causative 
cellular constituent set based on the rank assigned In step 716 and 
the results of step 7918 and/or step 7920 



W924 



Optionally, validate top ranking cellular constituents using gene 
knock outs/ins, transgenic construction, siRNA, drug treatments 
targeting candidate genes, time series experiments, etc. 



FIG. 79B 

77/91 



wo 2004/061616 



PCTAJS2003/041613 



Phenotypic statistic set for clinicai trait 1 


^ oUUU-1 




Piienotypic value for organism 1 


■^8004-1-1 




Phenotypic value for organism 2 






Phenotypic value for organism 3 


■^8004-1-3 




• 

• 

■ 
• 






Phenotypic value for organism Q 


8004-1 -Q 


• 
• 




Phenotypic statistic set for clinical trait Z 


8000-Z 




Phenotypic value for organism 1 


8004-Z-1 




Phenotypic value for organism 2 


■"^ 8004-Z-2 




Phenotypic value for organism 3 


8004-Z-3 




• ■ . ■ 

• 

■ 






Phenotypic value for organism Q 


8004-Z-Q 



FIG. 80 

78/91 



wo 2004/061616 PCTAJS2003/041613 



Abundance / genotype warehouse 






Cellular constituent 1 


■^8102-1 






Abundance statistic set 1 


8104-1 








Organism 1 


8106-1-1 










Abundance statistic 1 


8108-1-1 

\J i WW 1 1 








Organism 2 


8106-1-2 










Abundance statistic 2 


8108-1-2 










• 
• 
• 










Organism N 


^81 06-1 -N 










Abundance statistic N 


-^^8108-1-N 




■ 
• 
• 






Cellular constituent M 


^8102-M 






Abundance statistic set M 


81 04-M 








Organism 1 


■^8106-M-1 










Abundance statistic 1 


"^8108-M-1 








Organism 2 


~^8106-M-2 










Abundance statistic 2 


■^8108-M-2 










• 
• 










Organism N 


^8106-M-N 










Abundance statistic N 


■^8108-M-N 



FIG. 81 

* 

79/91 



wo 2004/061616 



PCTAJS2003/041(S13 



8104-G 



Abundance statistic for gene G from organism 1 
Abundance statistic for gene G from organism 2 
Abundance statistic for gene G from organism 3 
Abundance statistic for gene G from organism 4 



8108-G-1 
81 08-e-2 
8108-G-3 
81 08-G-4 



Abundance statistic for gene G from organism N 



81 08-G-N 



FIG. 82 

80/91 



wo 2004/061616 



PCTAJS2003/041613 



Abundance / genotype warehouse 



Cellular constituent 1 



Abundance statistic set 1 | 


8104-1 




Organism 1 I 


■^8106-1-1 






Abundance statistic 1 / Tissue a 


"^81 08-1-1 -a 






Abundance statistic 1 / Tissue b 


~^8108-1-1-b 






• 1 


** 










Organism N 


n~8106-1-N 






Abundance statistic N / Tissue a 


p~-8108-1-N-a 






Abundance statistic N / Tissue b 


K-8108-1-N-b 









Cellular constituent M 



Abundance statistic set M r 




Organism 1 \ 






Abundance statistic 1 / Tissue a [ 






Abundance statistic 1 / Tissue b | 














Organism N ' | 






Abundance statistic N / Tissue a 






Abundance statistic N / Tissue b | 






1 



81 02-M 
8104-1 

8106-M-1 

8108-M-1-a 

8108-M-1-b 



8106-M-N 

8108-M-N-a 

8108-M-N.b 



FIG. 83 



81/91 



wo 2004/061616 



PCT/US2003/041613 



Abundance statistic set 


-^8104-1 




Position 1 


-^8404-1-1 






Statistical score 


-^8406-1-1 




Position 2 


^8404-1-2 




1 Statistical score 


-^8406-1-2 




■ 






Position X 


8404-1 -X 






Statistical score 


8406-1 -X 


• 






Abundance statistic set 


-^8104-M 




Position 1 


~^18404-M-I 






1 Statistical score 


■-N-8406-M-1 




Position 2 


^ 8404-M-2 




1 Statistical score 


8406-M-2 




I. ' 






Position X 


8404-M-X 




1 Statistical score 


8406-M-X 


• 

- 







FIG. 84 

82/91 



wo 2004/061616 



PCT/US2003/041613 



FIG. 85A 



(I) X 



(ii) X 



(ill) X 



FIG. 85B 



Q 



FIG. 85C 



83/91 



wo 2004/061616 



PCTAJS2003/041613 



Q ►G ►! 

FIG, 85D 




84/91 



wo 2004/061616 



PCTAJS2003/041613 



Lod Score Curve for INS Chromosorhe 13 QTL 



i 

CO 



CO ~ 




0.4 

Chromosome 13 (cM) 



Fig. 86A 



Lod Score Curve for EPIPA Chromosome 13 QTL 



^ - 



& 

o 
o 

(/) <s 
o 




0^ 

Chromosome 13 (cM) 

Fig. 86B 



85/91 



wo 2004/061616 



PCT/US2003/041613 



Lod Score Curve for LEP Chromosome 13 QTL 



<o - 



IS 



CM - 



O - 




T 

0.0 



T 

0^2 



f 

0.4 



0.6 

Chromosome 13 (cM) 
Rg. 86C 



Lod Score Curve for CHDL Chromosome 13 QTL 



in - 



§ 

-o 
o 



CO - 



cs - 



o - 




Oj6 

ChromosofTid 13 (dvQ 
Fig. 86D 



r 

1.0 



86/91 



wo 2004/061616 



PCT/US2003/041613 




10^00 ^ifCJJ 



a 



! V. 



'fir 



■t. 



1 5 



<N^4!;CO CO 



■V 



E 

(0 



0^ 

I.I ^ DC ^ '^^o>S 2 



O §^ E ^ 0. £d : C F, >< CO S 
J Q J O O O O S O.? 



87/91 



wo 2004/061616 



PCT/US2003/041613 



1 mdwdsllvn gsnitppcel glenetlfcl dqprpskewq pavqillysl ifllsvlgnt 

61 Ivitvlirnk rmrtvtnifl Islavsdlml clfctnpfnli pnllkdfifg savcktttyf 

121 mgtsvsvstf nlvaislery gaickplqsr vwqtkshalk viaatwclsf timtpypiys 

181 nlvpftkniin qtanmcrfll pndvmqqswh tflllilfli pgivmmyayg lislelyqgi 

241 kfeasqkksa kerkpsttss gkyedsdgcy Iqktrpprkl elrqlstgss sranrirsns 

301 saanlmakkr virmlivivv Ifflcwmpif sanawraydt asaerrlsgt pisfilllsy 

361 tsscvnpiiy* cfnmkrfrlg fmatfpccpn pgppgargev geeeeggttg aslsrfsysh 

421 msasvppq (SEQ ID NO: 30) 



Fig. 88 



88/91 



wo 2004/061616 



PCT/US2003/041613 




89/91 



wo 2004/061616 



PCT/US2003/041613 



(A 



«2 



> 
o 



lA 
H 

a 
o 



(A 

2 re 



u 
o 

w 

(0 

w 
a 

4-1 

o 
a 



E 

01 



J? 

> 
•a 
o 

(0 

4-1 

e 
u 



< 




I 



(A 



3 



8 



n 



o 



S 

rsi 

CO 



fM 
O 
O 

o 



r2 



fM 



9 
m 
m 



O 
O 

« 

o 



■o 

C 



> 
o 

^ A 

3 > 



Ll 



(A 
» 

E 

V 



in 
p 



o 

u 



90/91 



wo 2004/061616 



PCTAJS2003/041613 



(0 
>• 

o 

A 

« 

OA 



10 
0) 

E 



o 

CM 
V 

HI 

z 

CD 



C 





U, 



9: 

CO 



■1 



01 



U 

c 



Im 

.o 



.1 



O 

• 



N 
GO 
N 



00 

o 



in 
o 



is 

o 



o 



00 

o 



o 



m 

o 
I 

lU 
00 



N 

o 



09 

« 

JO 

o 
> 



a* 

rs 

A 

C S 

d) 
3 



5 S 



(0 

o 
13 
E 
£ 

o 

s 



91/91 



